Earlier today, AWS announced that it would begin rebooting instances across its EC2 service over the coming days, likely in order to patch its underlying (Xen) hypervisors. This maintenance event is happening on very short notice; here’s what you need to know as an AWS (and Scalr) customer.
What You Need To Know
With earlier AWS maintenance events, it was possible to stop and restart an instance in order to have it migrate to a patched host (and avoid an uncontrolled reboot). This time, however, AWS is not guaranteeing that restarting an instance will have that effect.
Here’s why. AWS hosts can be split into two groups: hosts that have already been patched or don’t need the patch (let’s call these “good” hosts), and hosts that need the patch (let’s call these “bad hosts”).
If one of your instances is on a “bad host”, you’ll want to move it to a “good host” by stopping it and restarting it. Unfortunately, so does every other AWS customer. When the “good host” capacity runs out (and it may already have), instances you restart will land on “bad hosts” again, and will still have to go through a reboot.
Are Your Instances Affected?
You can view which (if any) of your instances are affected by logging in to the EC2 Console and opening the “Events” Tab. Be mindful that this may not be updated in real-time (though AWS is reportedly working on this).
What Can You Do?
For critical instances (ones which host services that you can’t afford to have go down unexpectedly), it’s worthwhile to restart as soon as possible, on the off chance that the instance restarts on a “good host”.
Note that when the patching begins (in the coming days), it’s possible that more “good host” capacity will become available, so you should definitely consider re-trying later if you can’t get one of your instances migrated to a “good host” just now.
Consider Transitioning To An Unaffected Instance Type
Finally, if that’s an option for you, you might want to transition your instances to one of the newer generation instance types that aren’t affected (e.g. m3’s). Doing so requires terminating the instance and launching a replacement, which means that:
If you aren’t automating the deployment of software on your instances (e.g. using Chef), and if you’ve made modifications to your instances since launching them (such as installing packages), you should re-image first (using instance snapshot).
Data stored on ephemeral volumes will be lost, unless you first copy it to e.g. a persistent EBS volume.
If you decide to go down this route, then here’s how Scalr lets you do the migration in a controlled manner (by migrating a Farm Role to a new instance type):
Snapshot the instance. Use the following options:
Choose “Replace Role on all Farms”
Check “Do not replace already running servers”
Don’t change the Role name
Wait for the snapshot to complete
Update the Instance Type
Increase “Minimum Instances” in the Scaling Tab by 1
Wait for the new instance to come online
Check that the new instance is functional, and migrate the data from one of your old instances to the new one
Terminate the old instance you migrated (Scalr will then launch an additional new instance). Repeat the migrate-data-then-terminate process until you have migrated all your instances.
Once you’re done, reset “Minimum Instances” to its original value
What If You Can’t Migrate?
If you’re unable to migrate before the maintenance starts, you’ll need to closely monitor your instances throughout the maintenance event, and ensure that as soon as your hosts come back up, all the services you expect to have running are indeed running (such as e.g. webservers).
In other words, you need to make sure you have init scripts for anything you want running on those instances, because when your instance is rebooted, all the running processes will be stopped.
What Is Going On?
At this point, we can only speculate on what exactly is going on. The very short timeline leads us to believe AWS is dealing with a security issue (it might one of Xen Security Advisories #104 through #106, and #108; the latter is still confidential at this point, and scheduled to be published on October 1st).