Introduction
Troubleshooting issues in an Auto Scaling environment is not always easy. This is especially true when instances are terminated before anyone can log in and inspect them. The challenge becomes even bigger when those instances run in private subnets and do not allow SSH access.
In this troubleshooting-focused blog, we walk through a real-world scenario where a new application deployment caused EC2 instances to fail Application Load Balancer (ALB) health checks. As a result, Auto Scaling kept replacing instances in a loop. More importantly, this article explains how to troubleshoot the issue quickly and safely, without changing the overall architecture.
Environment Overview
Before looking at the problem, it helps to understand the environment:
- A CMS application running on Amazon EC2
- EC2 instances managed by an Auto Scaling group
- Instances placed in private subnets
- An Application Load Balancer deployed in public subnets
- AWS Systems Manager Session Manager used for access
- Application logs sent to Amazon CloudWatch Logs
This setup follows common AWS best practices for security, scalability, and operations.
Problem Description
After deploying a new version of the CMS application, the team noticed unusual behavior:
- New EC2 instances launched successfully
- ALB health checks started failing soon after launch
- The ALB marked the instances as unhealthy
- Auto Scaling terminated and replaced those instances
- The same pattern repeated again and again
At first glance, everything appeared to work as designed. Auto Scaling replaced unhealthy instances, and the ALB protected users from bad targets. However, the application never reached a stable state.
CloudWatch Logs showed that the application started, but they did not show a clear error that explained the health check failures.
Why Troubleshooting Was Difficult
Several basic checks were performed first. For example, the team reviewed application logs, verified health check settings, and confirmed security group rules. However, none of these steps revealed the real issue.
The main problem was timing.
Auto Scaling terminated the unhealthy instances too quickly.
Because of this, there was no chance to:
- Log in to the instance
- Check whether the application was listening on the correct port
- Test the health check endpoint locally
- Review configuration files and environment variables
As a result, traditional troubleshooting methods were not enough.
Key Troubleshooting Insight
When Auto Scaling replaces instances too fast, adding more logs or redeploying the application often does not help. Instead, the focus should be on keeping the failing instance alive long enough to inspect it.
Therefore, the main goal became clear:
Temporarily stop Auto Scaling from terminating unhealthy instances, while still protecting production traffic.
Step-by-Step Troubleshooting Approach
Step 1: Suspend Instance Termination
The first step was to suspend the Terminate process in the Auto Scaling group.
As a result:
- Auto Scaling continued launching instances
- The ALB continued running health checks
- Unhealthy instances were no longer terminated automatically
This created a safe window for troubleshooting. In addition, the ALB still routed traffic only to healthy targets, so users were not affected.
Step 2: Log In Using Session Manager
Next, the team logged in to one of the unhealthy instances using AWS Systems Manager Session Manager.
This worked well because:
- The instance was in a private subnet
- No SSH access or bastion host was required
- Access was controlled using IAM permissions
Now, troubleshooting could happen directly on the instance that failed the health checks.
Step 3: Check Application and Health Endpoints
Once logged in, the team performed several direct checks:
- Confirmed that the application process was running
- Verified the application was listening on the expected port
- Tested the ALB health check endpoint locally
- Reviewed startup logs and configuration files
- Checked environment variables and external dependencies
For example, testing the health endpoint locally quickly showed how the application responded during startup.
Root Cause and Fix
The investigation revealed a mismatch between the new application version and the ALB health check behavior. Specifically:
- The application needed more time to start, or
- The health check endpoint returned a non-200 response during initialization
After updating the application configuration and redeploying:
- New instances passed ALB health checks
- Targets moved to a healthy state
- Traffic flowed normally
Cleanup and Validation
After confirming the fix, the final steps were simple:
- Re-enable the Auto Scaling
Terminateprocess - Monitor instance health and scaling events
- Confirm the application remained stable under normal load
As a result, the environment returned to normal operation.
Key Takeaways
- ALB health check failures can be hard to debug in Auto Scaling environments
- Logs alone do not always show the full picture
- Keeping a failing instance alive is often the fastest way to find the root cause
- Suspending Auto Scaling termination is a safe and effective troubleshooting technique
Conclusion
Modern cloud environments move fast, and so do their failure modes. When Auto Scaling removes instances before you can inspect them, troubleshooting becomes frustrating.
However, by temporarily suspending instance termination and using AWS Systems Manager Session Manager, engineers can gain direct access to failing instances and resolve issues faster. This approach keeps production traffic safe while giving teams the visibility they need to fix problems with confidence.
This technique is a valuable troubleshooting skill for any architect or engineer working with Application Load Balancers and Auto Scaling on AWS.