# ECS Auto Scaling: High-Resolution Metrics vs. Traditional Scaling Approaches

The launch of high-resolution (20-second) metrics for Amazon ECS reshapes the auto scaling calculus for latency-sensitive financial workloads. This article compares the four primary scaling approaches — target tracking with high resolution, step scaling, scheduled scaling, and predictive — with concrete trade-offs, real numbers, and a decision matrix for production environments.

- URL: https://fernando.moretes.com/blog/ecs-auto-scaling-metricas-de-alta-resolucao-vs-abordagens-tradicionais-amazon-ecs-i

- Markdown: https://fernando.moretes.com/blog/ecs-auto-scaling-metricas-de-alta-resolucao-vs-abordagens-tradicionais-amazon-ecs-i/article.md?lang=en

- Published: 2026-06-18T21:06:38.000Z

- Category: AWS & Cloud

- Tags: ECS, Auto Scaling, CloudWatch, Fargate, FinancialGrade, Observability, Kubernetes, CostOptimization

- Reading time: 9 min

- Source: [Amazon ECS introduces new high-resolution metrics for faster service auto scaling](https://aws.amazon.com/blogs/aws/amazon-ecs-introduces-new-high-resolution-metrics-for-faster-service-auto-scaling/)

---

When Amazon ECS announced support for high-resolution metrics at 20-second intervals and cut scale-out trigger time from 363 to 86 seconds — a 4.2x improvement — the immediate question for anyone operating financial systems wasn't 'should I enable this?' but rather 'does this replace my current strategy or complement it?' I work with payment and trading platforms that need to absorb demand spikes in windows of seconds, not minutes. The answer is not trivial: each scaling approach carries a distinct profile of cost, operational complexity, over-provisioning risk, and failure exposure window. This article runs an honest bake-off across the four viable strategies.

## The Real Problem: The Exposure Window in Financial Systems

In payment systems, the window between the onset of a demand spike and the moment new tasks are ready to receive traffic is the highest operational risk interval. During that period, existing tasks absorb load beyond their nominal sizing — which in Fargate means CPU throttling and in EC2 means host-level resource contention. For a payment gateway processing 8,000 TPS under normal conditions, a 3x spike during Black Friday can push error rates from 0.01% to 3% in under two minutes if scaling doesn't react in time.

The AWS benchmark is revealing in this context: before high-resolution metrics, the full cycle of detection → policy evaluation → Application Auto Scaling trigger → task provisioning averaged 386 seconds with standard target tracking. With 20-second metrics, that cycle dropped to 109 seconds. But the number that actually matters for reliability engineers is the **risk exposure window**: the 277 seconds saved represent nearly 5 fewer minutes of the system operating above nominal capacity.

This has direct implications for baseline sizing. If you kept 40% idle capacity as a safety buffer because scaling took 6 minutes, you can now recalibrate that buffer to 15-20% with the same risk profile — which in Fargate with 200 tasks at 2 vCPU/4GB represents meaningful monthly savings. The trade-off, as always, lives in the configuration details and the incremental CloudWatch cost.

## Measurable Impact of High-Resolution Metrics

- **4.2x** — Faster scale-out trigger. From 363s to 86s in detection and trigger time (AWS benchmark)
- **109s** — Full scaling cycle. From demand spike to new task ready — 72% faster than the previous 386s
- **~20%** — Potential baseline task reduction. Idle capacity buffer can be reduced without compromising availability SLO

## The Four Strategies: What Each One Actually Delivers

**Target Tracking with High Resolution (20s)** is the new default option for reactive workloads. You define a target — say, 60% average CPU — and Application Auto Scaling adjusts task count to maintain that target. With 20-second metrics via `ECSServiceAverageCPUUtilizationHighResolution` or `ECSServiceAverageMemoryUtilizationHighResolution`, the CloudWatch Alarm evaluation period drops from 3 periods of 60s (3 minutes) to 3 periods of 20s (1 minute), which explains most of the gain. The additional cost is real: CloudWatch charges for high-resolution metrics per metric/month beyond the free tier, and each ECS service publishes multiple dimensions.

**Step Scaling** was for years the choice for teams needing granular control over scaling response. You define utilization bands and corresponding actions: if CPU > 70%, add 5 tasks; if CPU > 85%, add 15 tasks. The problem is maintenance complexity: each shift in load profile potentially requires retuning the bands. In financial environments with dozens of services, this becomes technical debt quickly. With high-resolution target tracking delivering aggressive behavior without that manual configuration, the step scaling use case narrows considerably.

**Scheduled Scaling** remains indispensable for predictable events: market open at 9am, end-of-day batch at 6pm, scheduled marketing campaigns. It's not an alternative to target tracking — it's a complement. The combination of scheduled scaling to pre-warm capacity + high-resolution target tracking to absorb intraday variation is the pattern I recommend for financial platforms.

**Predictive Scaling** uses ML to analyze 14 days of historical patterns and provision capacity proactively. It's powerful for workloads with stable seasonality, but has important limitations: it requires at least two weeks of history, doesn't react to unexpected events, and can generate over-provisioning during low-demand periods if historical patterns include atypical spikes.

## Head-to-Head Comparison: ECS Auto Scaling Strategies
| Criterion | Criterion | High-Res Target Tracking (20s) | Step Scaling (60s) | Scheduled Scaling | Predictive Scaling |
| --- | --- | --- | --- | --- | --- |
| Trigger time (scale-out) | ~86s (AWS benchmark) | ~180-360s typical | Pre-defined (0s reaction) | Anticipatory (minutes ahead) | — |
| Configuration complexity | Low — 1 metric, 1 target | High — bands, cooldowns, adjustments | Medium — requires event calendar | Low after setup — ML manages | — |
| Additional CloudWatch cost | Yes — paid high-resolution metrics | No — standard metrics free | No — no additional metrics | Minimal — uses existing history | — |
| Reaction to unexpected spikes | Excellent — 86s end-to-end | Good — if bands well calibrated | None — reactive only to plan | Weak — doesn't detect anomalies | — |
| Over-provisioning risk | Low — precise target tracking | Medium — depends on band tuning | High — fixed pre-allocated capacity | Medium-high — atypical historical patterns | — |
| Suitability for financial systems | High — unpredictable spikes, strict SLO | Medium — useful as granular fallback | High — predictable market events | Medium — stable seasonality workloads | — |
| Required operational maturity | Low — simple to operate | High — requires experienced SRE | Medium — requires calendar management | Medium — requires ML model validation | — |

## Technical Configuration: What Actually Changes in the Stack

Enabling high-resolution metrics isn't just a console toggle — it has implications across the entire observability chain. When you activate `20-second resolution metrics` in the Monitoring section of an ECS service, the ECS metrics daemon starts publishing to CloudWatch with `StorageResolution: 20` instead of `StorageResolution: 60`. This means associated alarms need to be configured with `Period: 20` and `EvaluationPeriods: 3` to maintain the 1-minute evaluation behavior — any existing alarm with `Period: 60` will continue working but won't benefit from the higher resolution.

In Application Auto Scaling, the target tracking policy needs to explicitly reference the new metrics: `ECSServiceAverageCPUUtilizationHighResolution` instead of `ECSServiceAverageCPUUtilization`. This is a silent breaking change for anyone using CloudFormation or Terraform without updating the `AWS::ApplicationAutoScaling::ScalingPolicy` resource — the service will continue scaling, but with the old 60-second metric.

From an IAM perspective, no new permissions are needed: `cloudwatch:PutMetricData` and `application-autoscaling:PutScalingPolicy` already cover the case. But for audit purposes in financial environments, I recommend adding an `aws:RequestedRegion` condition key in the Application Auto Scaling policy to prevent scaling policies from being inadvertently created in non-approved regions — a cost and compliance vector I've seen in production.

In CloudFormation, the update sequence matters: first update the `AWS::ECS::Service` with the high-resolution metrics configuration, wait for the deployment to complete (the service needs to generate at least one data point at the new resolution), and only then update the `ScalingPolicy`. Making both changes in the same stack update can result in a high-resolution alarm pointing to a service that hasn't yet published metrics at that resolution, generating an `INSUFFICIENT_DATA` period that blocks scaling.

## Decision Flow and Timing: Four ECS Scaling Strategies

Full scaling cycle comparison for each strategy, from spike detection to new task ready for traffic. Timings reflect AWS benchmarks and operational experience in financial environments.

### 🟢 High-Res Target Tracking — 20s metrics

- CloudWatch 20s metric (data)
- CW Alarm 3×20s = 60s eval (security)
- App Auto Scaling trigger ~86s (compute)
- New ECS Task ready ~109s (compute)

### 🟡 Step Scaling — 60s metrics

- CloudWatch 60s metric (data)
- CW Alarm 3×60s = 180s eval (security)
- App Auto Scaling trigger ~180-360s (compute)
- New ECS Task ready ~386s (compute)

### 🔵 Scheduled Scaling — proactive

- Schedule Rule (cron/rate) (messaging)
- Pre-warmed Tasks t < 0 (before spike) (compute)

### 🟣 Predictive Scaling — ML-driven

- ML Model 14-day history (ai)
- Capacity Forecast minutes ahead (data)
- Pre-provisioned Tasks before traffic arrives (compute)

### Flows

- spike -> cw20: publishes 20s metric
- cw20 -> alarm20: evaluates 3 periods
- alarm20 -> aas20: triggers policy
- aas20 -> task20: provisions task
- spike -> cw60: publishes 60s metric
- cw60 -> alarm60: evaluates 3 periods
- alarm60 -> aas60: triggers policy
- aas60 -> task60: provisions task
- sched -> taskpre: scales before spike
- ml -> forecast: generates forecast
- forecast -> taskml: pre-provisions

## Observability and SLOs: What to Monitor with High Resolution

Adopting 20-second metrics changes what makes sense to monitor — and how. With 60-second metrics, a CloudWatch Dashboard with 1-minute refresh was sufficient to track scaling behavior in real time. With 20 seconds, you need dashboards with 10-20 second refresh to capture the available granularity, which has cost implications for GetMetricData if you have automations or observability tools querying that data at high frequency.

For SLOs in financial systems, high-resolution metrics open the possibility of defining **error budget burn rate** with smaller windows. If your SLO is 99.95% monthly availability, the error budget is ~21.6 minutes. With scaling reacting in 109 seconds instead of 386, each spike event consumes less error budget — meaning you can tolerate more events before triggering accelerated burn rate alerts.

In OpenTelemetry, I recommend instrumenting the scaling cycle itself as a span: capture the timestamp of the first data point above threshold, the timestamp of the Application Auto Scaling trigger (available via EventBridge with `source: aws.application-autoscaling`), and the timestamp of the first successful health check of the new task. These three points define the real **scaling latency** of your platform — an SLI that few teams measure explicitly but that is critical for understanding behavior under load.

An important operational detail: the default target tracking cooldown is 300 seconds for scale-in and 0 for scale-out. With 20-second metrics, scale-out can be triggered multiple times in sequence before new tasks are ready, resulting in temporary over-scaling. I recommend configuring `ScaleOutCooldown: 120` to avoid this behavior in services with task startup times above 30 seconds.

## Decision Matrix: Which Strategy for Which Context

### High-Resolution Target Tracking (20s)

**Pros**
- 86s trigger — smallest risk exposure window
- Simple config: 1 metric, 1 target, no bands
- Allows reducing task baseline by 15-25%
- Replaces step scaling in most use cases

**Cons**
- Additional CloudWatch cost for high-resolution metrics
- Over-scaling risk with misconfigured cooldown
- Requires explicit update of existing IaC policies

**Verdict:** Default choice for workloads with unpredictable spikes and strict SLO

### Step Scaling (60s)

**Pros**
- Granular control over scaling magnitude per band
- No additional CloudWatch cost
- Useful when different spike intensities require distinct responses

**Cons**
- Trigger time 2-4x slower than high resolution
- High maintenance complexity — requires frequent retuning
- Use case narrows with high-resolution target tracking arrival

**Verdict:** Use only when dramatically different scaling responses per utilization band are required

### Scheduled Scaling

**Pros**
- Zero reaction latency — capacity ready before spike
- No additional metrics cost
- Ideal for market events and scheduled campaigns

**Cons**
- Useless for unplanned spikes
- Guaranteed over-provisioning during low-demand periods
- Requires active event calendar management

**Verdict:** Mandatory complement to target tracking in financial platforms with predictable events

### Predictive Scaling

**Pros**
- Anticipates capacity before spike based on historical patterns
- Reduces dependency on manual schedule configuration
- Combines well with target tracking as a safety layer

**Cons**
- Requires 14 days of history — doesn't work for new services
- Doesn't react to unexpected events or demand anomalies
- Can generate over-provisioning if history contains atypical spikes

**Verdict:** Best for workloads with stable, well-documented seasonality; use with target tracking as reactive fallback

> **The Composition Pattern I Recommend for Financial Production:** The correct answer for most financial systems is not to choose one strategy — it's to compose three. **Scheduled Scaling** to pre-warm capacity for predictable events (market open, end-of-day batch, campaigns). **High-Resolution Target Tracking** as the primary reactive layer to absorb intraday variation and unexpected spikes. **Predictive Scaling** disabled by default, enabled only for services with more than 30 days of stable history and reviewed quarterly. Step Scaling should be actively deprecated — if you still have step scaling policies in production, high-resolution target tracking is the natural replacement with less code to maintain.

## Real Cost: Calculating the CloudWatch vs. Compute Savings Trade-off

The cost argument for high-resolution metrics is more favorable than it first appears, but requires an honest calculation. CloudWatch charges high-resolution metrics at the custom metrics tier: $0.30 per metric/month for the first 10,000 metrics. Each ECS service with high resolution enabled publishes approximately 4-6 metrics (CPU, Memory, RequestCount per service and cluster dimension). For a platform with 50 ECS services, that's 200-300 additional metrics, or $60-90/month in CloudWatch.

On the savings side, the argument is more robust. Consider a Fargate service running 100 tasks at 2 vCPU/4GB in us-east-1: cost per task is approximately $0.09/hour ($64.80/month per task). If the baseline can be reduced from 100 to 80 tasks (-20%) because reactive scaling is 4x faster, the monthly savings are 20 tasks × $64.80 = $1,296/month. The additional CloudWatch cost of $60-90 represents less than 7% of the compute savings — an extremely favorable ROI.

The calculation changes for services with many custom metric dimensions or for accounts with high GetMetricData volume. If you have dashboards, alerts, and automations querying high-resolution metrics at high frequency, the API cost can exceed the metrics storage cost. I recommend using **CloudWatch Metric Streams** to export metrics to an analytical data store (S3 + Athena or Datadog) instead of querying via GetMetricData in a loop — this reduces API cost by 60-80% for observability use cases.

For multi-account environments with AWS Organizations, CloudWatch cost is consolidated but charged per account. Consider centralizing scaling alarms in a dedicated observability account using **CloudWatch cross-account observability** — this simplifies management and allows comparing scaling behavior across dev, staging, and production accounts with a single view.

## Anti-Patterns I've Seen in Production

- Enabling high-resolution metrics without updating the scaling policy — the service publishes 20s metrics but the alarm evaluates at 60s, no speed gain and additional cost.
- Keeping ScaleOutCooldown at 0 with high resolution — results in cascading multiple scale-outs before previous tasks are ready, generating 2-3x over-scaling.
- Using step scaling and high-resolution target tracking on the same service simultaneously — policies compete and the resulting behavior is unpredictable; choose one.
- Not instrumenting the scaling cycle in the observability system — without measuring scaling latency as an SLI, you don't know if the high-resolution investment is delivering the expected benefit.
- Applying high resolution to all services indiscriminately — background services with relaxed SLOs don't justify the additional cost; reserve for services on the critical latency path.

> **Curator's Note:** In practice, what impressed me most about this launch wasn't the 4.2x number — it was the operational simplification. I've maintained step scaling policies on payment platforms with 12 utilization bands, each manually calibrated after production incidents. That was technical debt disguised as control. High-resolution target tracking delivers equivalent or superior behavior with a fraction of the configuration. My immediate recommendation for any team operating ECS in financial production: audit your step scaling policies, identify which can be replaced by high-resolution target tracking, and measure scaling latency before and after — you'll want that data to justify the change to stakeholders and to correctly calibrate cooldowns. The hard-won lesson: configuration complexity is not synonymous with control; often it's the opposite.

## Verdict: Migrate to High Resolution, But in the Right Sequence

Target tracking with 20-second high-resolution metrics is the dominant choice for ECS services on the critical latency path in financial environments. The ROI is clear: the additional CloudWatch cost ($60-90/month for 50 services) is outweighed by compute savings from reducing the task baseline (-15-25%), and reducing the risk exposure window from 386s to 109s has direct value in the SLO error budget. The migration should follow this sequence: (1) enable high-resolution metrics on the ECS service, (2) wait for deployment and first data point, (3) update the ScalingPolicy to reference the new metrics, (4) configure ScaleOutCooldown between 60-120s depending on task startup time, (5) instrument scaling latency as an SLI in your observability system. Step scaling should be actively deprecated. Scheduled scaling remains a mandatory complement for predictable events. Predictive scaling is a third layer for services with stable seasonality and sufficient history. The composition of these three strategies — not the choice of one — is the production pattern for high-availability financial platforms.

## References

- [Amazon ECS introduces new high-resolution metrics for faster service auto scaling](https://aws.amazon.com/blogs/aws/amazon-ecs-introduces-new-high-resolution-metrics-for-faster-service-auto-scaling/)
- [ECS Service Auto Scaling — AWS Documentation](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html)
- [Application Auto Scaling — Target Tracking Scaling Policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html)
- [CloudWatch High-Resolution Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html#high-resolution-metrics)
- [CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/)
- [AWS Well-Architected — Performance Efficiency Pillar](https://docs.aws.amazon.com/wellarchitected/latest/performance-efficiency-pillar/welcome.html)
- [CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html)
- [Site Reliability Engineering — Google (SLO/Error Budget)](https://sre.google/sre-book/service-level-objectives/)