Troubleshooting ALB Health Check Failures in an Auto Scaling Environment

Introduction

Troubleshooting issues in an Auto Scaling environment is not always easy. This is especially true when instances are terminated before anyone can log in and inspect them. The challenge becomes even bigger when those instances run in private subnets and do not allow SSH access.

In this troubleshooting-focused blog, we walk through a real-world scenario where a new application deployment caused EC2 instances to fail Application Load Balancer (ALB) health checks. As a result, Auto Scaling kept replacing instances in a loop. More importantly, this article explains how to troubleshoot the issue quickly and safely, without changing the overall architecture.

Environment Overview

Before looking at the problem, it helps to understand the environment:

A CMS application running on Amazon EC2
EC2 instances managed by an Auto Scaling group
Instances placed in private subnets
An Application Load Balancer deployed in public subnets
AWS Systems Manager Session Manager used for access
Application logs sent to Amazon CloudWatch Logs

This setup follows common AWS best practices for security, scalability, and operations.

Problem Description

After deploying a new version of the CMS application, the team noticed unusual behavior:

New EC2 instances launched successfully
ALB health checks started failing soon after launch
The ALB marked the instances as unhealthy
Auto Scaling terminated and replaced those instances
The same pattern repeated again and again

At first glance, everything appeared to work as designed. Auto Scaling replaced unhealthy instances, and the ALB protected users from bad targets. However, the application never reached a stable state.

CloudWatch Logs showed that the application started, but they did not show a clear error that explained the health check failures.

Why Troubleshooting Was Difficult

Several basic checks were performed first. For example, the team reviewed application logs, verified health check settings, and confirmed security group rules. However, none of these steps revealed the real issue.

The main problem was timing.

Auto Scaling terminated the unhealthy instances too quickly.

Because of this, there was no chance to:

Log in to the instance
Check whether the application was listening on the correct port
Test the health check endpoint locally
Review configuration files and environment variables

As a result, traditional troubleshooting methods were not enough.

Key Troubleshooting Insight

When Auto Scaling replaces instances too fast, adding more logs or redeploying the application often does not help. Instead, the focus should be on keeping the failing instance alive long enough to inspect it.

Therefore, the main goal became clear:

Temporarily stop Auto Scaling from terminating unhealthy instances, while still protecting production traffic.

Step-by-Step Troubleshooting Approach

Step 1: Suspend Instance Termination

The first step was to suspend the Terminate process in the Auto Scaling group.

As a result:

Auto Scaling continued launching instances
The ALB continued running health checks
Unhealthy instances were no longer terminated automatically

This created a safe window for troubleshooting. In addition, the ALB still routed traffic only to healthy targets, so users were not affected.

Step 2: Log In Using Session Manager

Next, the team logged in to one of the unhealthy instances using AWS Systems Manager Session Manager.

This worked well because:

The instance was in a private subnet
No SSH access or bastion host was required
Access was controlled using IAM permissions

Now, troubleshooting could happen directly on the instance that failed the health checks.

Step 3: Check Application and Health Endpoints

Once logged in, the team performed several direct checks:

Confirmed that the application process was running
Verified the application was listening on the expected port
Tested the ALB health check endpoint locally
Reviewed startup logs and configuration files
Checked environment variables and external dependencies

For example, testing the health endpoint locally quickly showed how the application responded during startup.

Root Cause and Fix

The investigation revealed a mismatch between the new application version and the ALB health check behavior. Specifically:

The application needed more time to start, or
The health check endpoint returned a non-200 response during initialization

After updating the application configuration and redeploying:

New instances passed ALB health checks
Targets moved to a healthy state
Traffic flowed normally

Cleanup and Validation

After confirming the fix, the final steps were simple:

Re-enable the Auto Scaling Terminate process
Monitor instance health and scaling events
Confirm the application remained stable under normal load

As a result, the environment returned to normal operation.

Key Takeaways

ALB health check failures can be hard to debug in Auto Scaling environments
Logs alone do not always show the full picture
Keeping a failing instance alive is often the fastest way to find the root cause
Suspending Auto Scaling termination is a safe and effective troubleshooting technique

Conclusion

Modern cloud environments move fast, and so do their failure modes. When Auto Scaling removes instances before you can inspect them, troubleshooting becomes frustrating.

However, by temporarily suspending instance termination and using AWS Systems Manager Session Manager, engineers can gain direct access to failing instances and resolve issues faster. This approach keeps production traffic safe while giving teams the visibility they need to fix problems with confidence.

This technique is a valuable troubleshooting skill for any architect or engineer working with Application Load Balancers and Auto Scaling on AWS.