AWS Outage Impacting Multiple Wyze Services - 12/15/21

Whew! Always harder trying to catch back up. Sorry for my delay!

I helped with sending a bunch of emails, push notifications, and in-app messages today to folks with devices that had a harder time connecting again than others (Wyze Video Doorbell, Wyze Thermostat, Wyze Sprinkler Controller, and Wyze Switch). I’m hoping that helps!

Here’s some basic troubleshooting to try for devices I’ve seen mentioned. If this doesn’t do it for you, please reach out to support.wyze.com so we can help (I won’t be super useful if these steps don’t do it).

Wyze Video Doorbell - Flip the breaker associated to the device to power cycle it. Or, if you prefer, disconnect it from the wall and press the Reset button behind the device.

Wyze Thermostat - Take the thermostat out from base and put it back in.

Wyze Sprinkler Controller - Unplug the power cable and plug it back in.

Wyze Switch -Either flip the breaker that device is on or press it for 20 seconds and go through the setup process again.

Wyze Robot Vacuum: Disconnect the vacuum from the charging station. Then press and hold the Power button to turn off the vacuum. Wait 30 seconds. Connect the vacuum to the charging station again to reboot the vacuum.

Wyze Plug: Unplugging it and plugging it back in will power cycle it. If you need to reset it, keep in mind that plugs purchased before 2021 will need you to wait for about 90 seconds before you can do the regular setup flow. We are working on fixing that!

We identified a few bugs for devices that prevented them from reconnecting as expected. I am confident that these bugs will be prioritized for fixes (though I’m not sure what the complexity will be).

To address some of the conversation here, sending too many emails linking to pages or services can overload the system with traffic and cause it to go down. It takes hours for us to contact all of our customers once we start sending direct messages because of this. While we do send out marketing emails and such regularly, those are often to segmented audiences and they also take hours to complete. It has an interesting effect on launch day traffic!

But that’s why folks were getting the push notifications so late in the day. This was combined with the initial delay from me because of the optimistic progress reports from AWS before we discovered how complicated the Wyze recovery would be. My key takeaway is to trust no one and build drafts even if I may not need them. That should help speed things up. I am sorry for the delay on that front.

The following information is stuff I’m sharing for transparency. I know some people find process information interesting and I’m also including information about my next steps to improve outage communication. If that’s not your preferred reading material, please feel free to skip it! :slightly_smiling_face:

I tend to get the Service Status page updated first and then do the community posts because they’re much faster than the emails, in-app notifications, and push notifications are. It allows me to broadcast information and catch at least some folks as quickly as possible while I work on the other components since it can take a LOT of time from start to finish for me to get everything I need and have someone review my work even with everyone working as quickly as possible (especially since I am newly trained in those systems and don’t have a lot of practice with them). I am also balancing talking to the devs so I stay updated, keeping an eye on the customer experience and reporting internally, and managing the updates on multiple platforms during this time.

I’m not going to say that I’ve had this 100% right in previous outages. But this stress tested my system in a pretty major way because of these key differences:

  1. The fact that it started before I woke up instead of during my day and when people are winding down. Seems like most outages tend to start in the afternoon. This meant that I was having to catch up to information, figure out what had already been done, and generally catch up to a moving outage train.

  2. The widespread nature of the outage. The impact scope for outages generally has not been this significant and that makes it easier to keep up and balance. The long duration before customers began seeing improvement was a related complication.

  3. The fact that we were having to wait for answers and updates from AWS and it was like a giant, urgent game of telephone. Not that it’s fun when an outage is triggered by something on our end, but it does make it easier to communicate, figure out what’s going on, and get updates quickly. I have a habit of sitting in on the engineer calls so I can pull information for the updates while they work.

I set up an outage SOP and have been maintaining it for a few years now. My next step is to get other folks added into the systems I work in (some like our email/push notification/in-app notifications system have limited seats available) and then perform drills so that folks I train know what to do, when to do it, and where to go as needed with confidence.

It was mentioned earlier in this thread that I should be getting more help for these things and that is a super fair point and a conclusion I also reached. The beginning stages of an outage especially have a LOT going on and spreading the load will help our community. We have grown enough that I cannot expect to be able to field all of the steps solo for big events anymore.

But I am probably still gonna kick people out when their work days are done as long as things things are less volatile. There are things I am willing to do myself but would not ask nor expect someone else to do. Staying up until 4 in the morning working is definitely one of them.

I’m reading through the feedback and taking it seriously. This includes the parts that were legitimately under my control. I really appreciate the kindness and candor that many of you approached with. It’s definitely time for me to flip this part to working smarter instead of harder.

9 Likes