AWS Data Analytics Service Landscape and Core Components
The AWS analytics ecosystem includes multiple services designed for different data processing patterns and scales. Each service solves specific problems within the larger architecture.
Foundational Analytics Services
Amazon Redshift is the data warehouse solution optimized for complex analytical queries on structured data at petabyte scale. It uses columnar storage and massively parallel processing to deliver fast query performance on historical data.
Amazon Athena is a serverless query service that analyzes data directly in Amazon S3 using SQL. You pay only for data scanned, making it ideal for ad-hoc queries without infrastructure overhead.
AWS Glue is the fully managed extract, transform, and load service. It automates data preparation, schema discovery, and integration through crawlers and transformation jobs.
Real-Time and Visualization Services
Amazon Kinesis handles real-time data streaming with sub-second latency for continuous data ingestion and processing. Stream data immediately to analytics tools or storage systems.
Amazon QuickSight provides business intelligence and visualization capabilities. It transforms processed data into dashboards and actionable insights for decision-makers.
Why Service Combinations Matter
The Solutions Architect exam tests your ability to architect end-to-end solutions, not just individual services. Common scenarios include:
- Use Redshift for batch analytics with consistent, repeated queries
- Use Athena for variable, one-time analytical queries
- Use Kinesis for real-time streaming when millisecond latency matters
- Use Glue to prepare data before loading into any analytical service
You must balance performance, cost, and operational complexity when recommending service combinations. For example, maintaining a Redshift cluster costs money even during idle periods, while Athena charges only per query.
Data Pipeline Architecture and ETL Processes
Building effective data pipelines requires understanding how data flows from source systems through transformation stages to final consumption. Each stage adds value by cleaning, enriching, or organizing data.
AWS Glue's Role in Data Pipelines
AWS Glue Crawlers automatically scan data sources and discover schemas. They populate the Glue Data Catalog, a centralized metadata repository. This eliminates manual schema maintenance and enables multiple services like Athena and Redshift Spectrum to access consistent information.
Glue Jobs perform transformations using Apache Spark or Python. They automatically scale infrastructure based on workload demands, charging only for execution time. When you need repeatable, scheduled transformations, Glue handles scaling automatically without manual intervention.
Batch vs. Streaming Pipeline Patterns
Batch processing collects data and transforms it on a schedule. Common triggers include S3 events or Lambda functions running at specific times. This pattern works well for daily reports or data that changes infrequently.
Streaming pipelines use Kinesis Data Streams for continuous ingestion. Use this when you need analytics within minutes or seconds, not hours. Streaming introduces operational complexity, so evaluate whether batch actually meets your requirements.
Optimization Strategies for Query Performance
Partitioning data in S3 by date, region, or other dimensions is essential. Partitioning enables faster queries and reduces scanning costs in Athena by filtering irrelevant data before processing.
Data format choices significantly impact analytics. Parquet and ORC formats compress 80-90 percent better than CSV and enable columnar querying. Use these formats for all analytical workloads.
When to Use Glue vs. Custom Solutions
Glue handles infrastructure scaling automatically, making it ideal for variable workloads. Custom Lambda-based ETL gives you flexibility but requires manual scaling management. The Solutions Architect exam tests whether you understand these operational trade-offs when recommending architectures.
Real-Time Analytics and Streaming Data Patterns
Real-time analytics addresses scenarios requiring insights within seconds or minutes instead of hours. This capability comes with trade-offs in operational complexity and cost.
Amazon Kinesis Data Streams Fundamentals
Kinesis Data Streams is the core service for ingesting high-volume continuous data with millisecond latencies. Streams consist of shards, where each shard supports:
- 1 MB per second of writes
- 2 MB per second of reads
- Capacity costs per shard-hour
Understanding shard scaling and cost implications is essential for the exam. Right-sizing shards prevents overpaying for unused capacity or undersizing for actual demand.
Kinesis Firehose for Delivery
Kinesis Data Firehose automatically buffers and delivers stream data to S3, Redshift, Elasticsearch, or Splunk. You don't write custom delivery code. Firehose introduces 60-90 second latency as data is batched before delivery, making it unsuitable for true real-time applications but excellent for preparing data for analysis.
Real-Time SQL with Kinesis Data Analytics
Kinesis Data Analytics runs SQL queries directly on streaming data using Apache Flink. This enables real-time aggregations, anomaly detection, and windowed calculations without building custom applications.
Common Real-Time Architecture Patterns
A typical architecture uses Kinesis Streams as the ingestion point. Firehose delivers the same stream to S3 for historical analysis. Kinesis Data Analytics provides real-time dashboards. This pattern solves multiple use cases:
- Financial transaction monitoring for fraud detection
- IoT sensor data processing for equipment alerts
- Application log analysis for real-time monitoring
The exam tests whether you understand that real-time complexity isn't always necessary. Batch processing suffices for many scenarios and introduces less operational overhead.
Data Warehousing and OLAP Workloads with Amazon Redshift
Amazon Redshift is AWS's massively parallel processing data warehouse designed for analytical queries on large datasets. Unlike transactional databases, Redshift optimizes for read-heavy analytical workloads.
Columnar Storage and Distribution
Columnar storage compresses similar data types together, reducing storage footprint and improving query speed for analytical queries that access specific columns. This differs fundamentally from traditional row-oriented databases optimized for transactions.
Distribution strategies determine how data is split across compute nodes. Choosing the right distribution key ensures even data distribution and minimizes network traffic during joins. Poor distribution choices directly impact query performance.
Sort keys provide similar benefits for frequently filtered columns. They enable efficient range scans and reduce data scanning costs.
Redshift Cluster Architecture
Clusters consist of a leader node coordinating queries and compute nodes processing data in parallel. Node types determine compute capacity and storage:
- Dense Compute (DC) nodes prioritize performance with SSDs
- Dense Storage (DS) nodes optimize for capacity with HDDs
- RA3 nodes provide flexible compute and storage scaling
Single-node clusters suit development and testing. Multi-node deployments enable high availability through replication and failover for production workloads.
Redshift Spectrum for S3 Analytics
Redshift Spectrum extends Redshift queries to unstructured data in S3 without loading it into the warehouse. This enables cost-effective analysis of archived data. Instead of loading multi-terabyte files, query them directly through Spectrum.
Cost Optimization and Operational Considerations
Use reserved instances for predictable baseline capacity and on-demand for variable workloads. Pause clusters during off-hours using automated schedules to reduce costs during non-business periods.
Understand Redshift's limitations: lack of built-in unstructured data support and operational overhead of cluster management. When Athena or data lakes provide better solutions, recommend those instead. The exam tests cost awareness and architectural trade-offs.
Data Visualization, BI Tools, and Actionable Insights
Data visualization completes the analytics pipeline by translating processed data into business intelligence and dashboards. This is where data becomes actionable for decision-makers.
Amazon QuickSight Capabilities
Amazon QuickSight is AWS's native business intelligence tool. It connects to multiple data sources including Redshift, Athena, S3, RDS, and third-party systems like Salesforce.
QuickSight uses SPICE (Super-fast, Parallel, In-memory Calculation Engine) to cache data in memory for sub-second dashboard performance. Interactive filtering doesn't require querying source systems repeatedly, enabling smooth user experiences.
SPICE Capacity and Cost Models
Each account has a 25 GB SPICE capacity limit. Understanding how to manage datasets within this constraint is relevant for the exam. The service offers author and reader pricing:
- Authors create analyses and dashboards
- Readers only view published dashboards
- Reader pricing reduces costs for large viewing audiences
QuickSight vs. Third-Party Tools
The Solutions Architect exam tests when to recommend QuickSight versus Tableau or Looker. QuickSight strengths include:
- Tight AWS integration (direct access to Redshift, Athena)
- Lower cost for read-heavy scenarios
- Simplified deployment without external tools
Third-party tools may provide advanced analytics capabilities QuickSight lacks. Choose based on your organization's requirements and team expertise.
Security and Data Governance
Row-level security restricts data visibility by user roles. Column-level permissions hide sensitive columns from specific audiences. These features are essential for designing secure analytics solutions.
The Glue Data Catalog tracks column-level lineage, helping analysts understand data provenance. This supports compliance and data quality efforts.
Dashboard Design for Different Audiences
Executive dashboards show high-level metrics and trends. Operational dashboards provide detailed information for day-to-day decisions. Designing for different audiences demonstrates architect-level thinking.
Optimize dashboard performance through appropriate aggregations, caching strategies, and automated refresh schedules. Poor dashboard performance frustrates users and wastes computational resources.
