Understanding High Availability Fundamentals
High availability means systems operate continuously without interruption, typically achieving 99.9% uptime or higher. The core principle is eliminating single points of failure through redundancy at infrastructure, application, and data levels.
Key Metrics You Need to Know
Recovery Time Objective (RTO) defines the maximum acceptable downtime. Recovery Point Objective (RPO) specifies acceptable data loss measured in time. On the exam, you'll calculate availability percentages across multiple systems.
For example, two independent systems each with 99% availability combine to 0.99 x 0.99 = 98.01%. AWS services include built-in redundancy across Availability Zones (AZs) within regions.
Availability vs. Reliability
Availability measures uptime percentage, while reliability measures how likely a system performs without failure. These are different concepts you must distinguish on the exam.
For high availability, design multi-tier architectures where each layer maintains availability independently. Common patterns include active-active configurations where traffic distributes across multiple instances, and active-passive setups where a standby resource takes over upon primary failure.
AWS Load Balancing and Auto Scaling for High Availability
Elastic Load Balancing (ELB) is fundamental to AWS high availability. It distributes incoming traffic across multiple targets, preventing overload on single resources.
Three Load Balancer Types
- Application Load Balancer (ALB): Best for HTTP/HTTPS with path-based routing
- Network Load Balancer (NLB): Extreme performance and ultra-low latency
- Classic Load Balancer (CLB): Legacy application support
Health checks automatically detect unhealthy instances and remove them from the pool. Only operational targets receive traffic.
Auto Scaling Groups for High Availability
Auto Scaling Groups (ASGs) automatically adjust EC2 instance count based on demand. The exam tests three critical components:
- Launch templates define instance configurations
- Scaling policies trigger scale-out or scale-in events
- Lifecycle hooks allow custom actions during scaling transitions
For high availability, configure ASGs across multiple AZs. Instance failures in one zone don't affect overall capacity.
Connection Draining and Capacity
Connection draining (also called deregistration delay) allows in-flight requests to complete before removing instances. Min, desired, and max capacity settings must align with availability requirements. The exam tests scenarios where you calculate required capacity if a zone fails.
Multi-Region and Multi-AZ Deployment Strategies
Multi-AZ deployments automatically replicate resources across Availability Zones within a single region. This protects against datacenter failures.
AWS databases like RDS offer Multi-AZ configurations with synchronous standby replicas. Failover typically completes within 60 to 120 seconds. For stateless applications, deploying across multiple AZs with load balancing provides high availability without complex failover logic.
Multi-Region for Disaster Recovery
Multi-region architecture extends protection beyond single-region failures. Use Route 53 health checks to automatically redirect traffic to healthy regions. Active-active configurations serve traffic simultaneously from multiple regions. Active-passive configurations maintain a standby region activated only during primary failure.
The exam tests your ability to select appropriate strategies based on business requirements. Multi-region deployments increase complexity and cost while reducing RPO and RTO.
Replication Technologies
RTO targets determine replication frequency. Critical systems need frequent snapshots or continuous replication. Less critical systems tolerate longer recovery times.
- DynamoDB Global Tables: Sub-second active-active sync
- S3 cross-region replication: Asynchronous replication
- Aurora Global Database: Millisecond RPO with managed replication
Failover must be automated for tight RTO requirements. Manual failover suffices for longer RTOs. The exam includes scenarios requiring you to recommend strategies based on RTO/RPO targets and cost constraints.
Data Persistence and Backup Strategies for Reliability
High availability requires robust backup and recovery mechanisms. Data protection extends across complete infrastructure failures.
AWS backup solutions operate at different tiers:
- EBS snapshots: Point-in-time recovery for volumes
- RDS automated backups: Transaction logs for point-in-time restore up to 35 days
- S3 versioning: Protection against accidental deletion
- S3 cross-region replication: Asynchronous regional protection
Backup strategies must balance RPO requirements with storage costs. Incremental snapshots reduce storage by capturing only changed blocks since the last snapshot.
AWS Backup and Recovery Testing
AWS Backup centralizes backup policies across EC2, RDS, EFS, and DynamoDB. Backup plans define frequency, retention periods, and backup windows. Recovery testing is critical. The exam includes questions about validating backup restore procedures.
RTO targets influence backup frequency. Hourly backups achieve lower RPO than daily backups. For databases, point-in-time recovery (PITR) mechanics use transaction logs to supplement snapshots. This allows recovery to any second within the retention window.
Long-Term Retention and Playbooks
AWS Backup lifecycle policies automatically transition backups to cold storage, optimizing costs for long-term retention. Some services like DynamoDB require separate backup strategies using on-demand backups or backup services.
Document disaster recovery playbooks specifying RPO, RTO, and recovery procedures for each critical system.
Monitoring, Alerting, and Incident Response
High availability requires continuous monitoring that detects problems before they impact users. CloudWatch provides metrics for all AWS services with customizable alarms.
The SysOps exam heavily emphasizes CloudWatch fundamentals:
- Metric dimensions: Filter data by resource
- Statistic types: Average, Sum, Minimum, Maximum, SampleCount
- Periods: Affect metric granularity
Setting appropriate alarm thresholds requires understanding normal baselines. Thresholds set too low generate false positives. Overly high thresholds miss real problems. Composite alarms aggregate multiple alarms to reduce false positives for complex conditions.
Application and Log Monitoring
CloudWatch Logs centralizes logs from EC2, Lambda, and on-premises sources. Log Insights queries help identify issues across thousands of log files quickly. CloudTrail tracks API calls and configuration changes, helping investigate how configurations degraded.
Automated Incident Response
EventBridge automates incident response by triggering Lambda functions or SSM documents when specific events occur. Systems Manager Automation enables complex response workflows like replacing failed instances or executing remediation scripts.
The exam tests your ability to determine appropriate metrics for detecting high availability issues: CPU and network utilization indicate capacity problems, application-level metrics reveal errors, and custom metrics monitor business-critical processes. Correlating multiple metric sources helps diagnose root causes.
