Troubleshooting Failed EC2 Instance Status Checks

Troubleshooting-Failed-EC2-Instances.png

There are two status checks for EC2 instances; the System Status Check and the Instance Status Check. System Status Check reports the health of the underlying AWS infrastructure used to host the instance. If this check is failing, the only option as an end user is to stop and start the instance to migrate to a healthy system. The Instance Status Check reports the health of the instance as seen by the hypervisor. This check validates that the operating system on the instance is accepting traffic. Rebooting or terminating (for Auto Scaling) the instance usually resolves a failed instance status check. But, what if it is a recurring problem? How can you troubleshoot an EC2 instance with no network connectivity?

I recently ran into a bizarre problem while working with a customer; We had dozens of instances that started failing the Instance Status Check all around the same time. We first tried rebooting the instances, but after an hour or so we would see the same problem again. After a couple of rounds, we decided to contact AWS Support and the support engineer shared an interesting technique for accessing these instances, attach a second network interface. It seems like a simple idea, but it had never occurred to me. A second network interface turned out to be the key to solving our problem.

Attaching a Secondary Interface in EC2

Here are the steps for reference.

  1. Make a note of the Subnet ID that the instance is using and go to the Network Interfaces tab in the EC2 console.
SubnetIDEC2
SubnetIDEC2

2. In the Network Interfaces tab, click "Create Network Interface" and fill out the form with the correct Subnet and Security Group. Optionally add a description and a static private IP.

SubnetSecurityEC2
SubnetSecurityEC2

3. I found that the easiest way to attach the Network Interface was to go back to the Instance in the EC2 console and select "Attach Network Interface" in the Actions menu.

AttachNetworkEC2
AttachNetworkEC2

4. Select the Interface that was created in the previous step.

SelectNetworkEC2
SelectNetworkEC2

The Instance should now have two private IPs.

PrivateIPsEC2
PrivateIPsEC2

Assuming the OS on the Instance is running, the Instance should now be accessible via the Secondary IP.

After attaching the network interface and logging in, we were able to see exactly what had happened and recreate the issue on test instances. The issue was related to changing the system time. At boot, the timezone was set to EST, but the clock was set to GMT. When we were updating the time from GMT to EST, the next DHCP request would fail and cause the network interface to set itself to an autoconf ip (169.254.X.X). Ultimately, this was fixed by updating the Xen guest agent and setting a registry key as described here.

This technique may not work in every case, but would be effective for misconfigured or accidentally downed network interfaces.

Always be in the Know, Subscribe to the Relus Cloud Blog!