AWS Unpacked #4: Storage

Categories: AWS

A Tour of AWS Storage Services: What to Use, When, and Why

Blog header image
TL;DR
AWS offers a variety of storage and database services, each optimized for different workloads. The key to mastering this topic understanding which service to use, when, and why. Services can be logically grouped into: 1. Block, File, and Object Storage (EBS, EFS, Instance Store, S3, FSx) 2. Relational and NoSQL Databases (RDS, Aurora, DynamoDB, ElastiCache) 3. Hybrid & Transfer Services (Snow Family, Storage Gateway, DataSync, AWS Transfer) 4. Big Data & Analytics (Athena, EMR, Lake Formation) Each category serves different use cases—from virtual machine volumes, web app backends, and caching, to massive data lakes and hybrid cloud migrations.

Let’s Set the Scene: A Map of AWS Storage & Database Services

To really get the most out of AWS’s storage services it’s important to understand the big picture. Think of AWS storage and databases as falling into a few broad categories based on the type of data access, performance needs, and deployment context.

1. Block, File, and Object Storage (the “Disk Drives”)

These are the building blocks of data storage for applications:

  • Amazon EBS (Elastic Block Store) – Think of this as your EC2 instance’s hard drive. It’s fast, persistent, and tied to a single AZ.
  • Amazon EFS (Elastic File System) – A shared network drive for EC2 instances. Supports concurrent access from multiple instances across AZs.
  • Amazon S3 (Simple Storage Service) – The Swiss Army knife of storage. It’s highly durable, infinitely scalable, and ideal for storing objects like files, backups, media, and static websites.
  • Instance StoreEphemeral disk physically attached to the host. Super fast, but data vanishes when the instance stops or terminates.

Mental model:

  • Use EBS for boot volumes and databases.
  • Use EFS when multiple EC2s need access to the same data.
  • Use S3 for nearly everything else—data lakes, backups, logs, you name it.

2. Managed Databases (Relational & Non-Relational)

AWS offers fully managed database services that take care of backups, patching, and failover:

  • Amazon RDS – A managed service for traditional relational databases (MySQL, Postgres, SQL Server, Oracle).
  • Amazon Aurora – AWS’s proprietary relational database, compatible with MySQL/Postgres but with enhanced performance and HA features.
  • Amazon DynamoDB – A NoSQL key-value/document database for millisecond performance at scale.
  • Amazon ElastiCache – In-memory cache layer for faster data retrieval (supports Redis and Memcached).

Mental model:

  • RDS/Aurora = structured data with complex queries.
  • DynamoDB = super fast, serverless NoSQL.
  • ElastiCache = cache it before you query it — for read-heavy workloads.

3. Big Data & Analytics

Big data workloads and ad-hoc querying live here:

  • Amazon Redshift: Managed data warehouse for structured, analytical queries (SQL-based).
  • Amazon Athena: Serverless SQL querying on S3 — great for ad-hoc queries over raw data.
  • Amazon EMR: Managed Hadoop/Spark for big data processing at scale.
  • AWS Lake Formation: Service for building secure data lakes on S3.
  • Amazon OpenSearch Service: Search and analytics engine — useful for logs and real-time dashboards.

🧠 Mental model:

  • Athena = SQL on S3.
  • EMR = big crunching engine.
  • Lake Formation = S3 + security + organization for analytical access. **

4. Data Archiving and Edge Storage

Designed for cold data or edge cases:

  • S3 Glacier & Glacier Deep Archive: Very low-cost archival storage (with slower retrieval).
  • AWS Snow Family (Snowcone, Snowball, Snowmobile): Physical devices for moving huge amounts of data into or out of AWS — when bandwidth is too slow.

5. Hybrid and Data Transfer Services

Get your data into or out of AWS (especially at scale):

  • AWS DataSync – Automates data transfer between on-prem storage and AWS.
  • AWS Transfer Family – Fully managed SFTP/FTP/FTPS servers that move data directly into S3, great for legacy systems or partners.
  • AWS Glue: Serverless ETL (Extract, Transform, Load) — for transforming and cataloging data.

6. Specialty Storage Services

Depending on the workload, these may be relevant:

  • Amazon FSx: Managed file systems — FSx for Windows, Lustre, NetApp ONTAP, etc.
  • Amazon QLDB / Timestream: Niche use cases — ledger-based and time-series DBs, respectively.

As you can see, there’s a lot to explore here. To keep this article focused and digestible, I’ll cover Amazon S3 and managed database services (like RDS) in separate, dedicated posts.


EBS

Amazon EBS provides raw block-level storage volumes that can be attached to EC2 instances. Think of it as a virtual hard disk: just like a physical hard drive that you plug into your desktop, you attach EBS to an EC2 instance to store data persistently.

Features:

  • EBS volumes persist independently from EC2 instances, so even if you stop or terminate the instance, the data on the EBS volume can remain (depending on configuration).
  • EBS volumes are bound to a single EC2 instance at a time (except for special multi-attach volumes on io2).
  • A single EC2 instance can have multiple EBS volumes attached.
  • EBS volumes are AZ-specific. You cannot attach an EBS volume created in us-east-1a to an EC2 instance in us-east-1b. If you need to move an EBS volume to another AZ:
    • Take a snapshot of the volume.
    • Copy the snapshot to the desired AZ or region.
    • Create a new EBS volume from the snapshot in the target AZ.
  • On termination of EC2:
    • Root EBS Volume: By default, it is set to delete on instance termination.
    • Additional Volumes: By default, they are not deleted on termination.
    • You can modify this behavior when launching the instance or after creation.
  • You provision EBS volumes in GBs. Billing is based on the provisioned capacity, not actual usage. Example: If you provision 100 GB but only use 5 GB, you still pay for 100 GB.
  • You can increase the size, change volume type, or modify IOPS of an existing EBS volume after creation. Some changes require a reboot or re-mount, others can be done live.
  • Amazon EBS offers multiple volume types tailored to different performance and cost needs. They fall into two broad categories: SSD-backed and HDD-backed volumes.

    Volume Type Storage Type Older Name (if any) Use Case Size Range Max IOPS Max Throughput Root Volume? Cost Notes
    gp2 SSD General Purpose SSD General purpose workloads 1 GiB–16 TiB Up to 16,000 (burst-based) 250 MiB/s ✅ Yes Pricing is per GB + IO is burst-based
    gp3 SSD General Purpose SSD (improved) Better general-purpose performance 1 GiB–16 TiB 16,000 (baseline 3K) 1,000 MiB/s ✅ Yes Cheaper than gp2, IOPS/throughput configurable separately
    io1 SSD Provisioned IOPS SSD High-performance DBs, latency-sensitive apps 4 GiB–64 TiB Up to 64,000 1,000 MiB/s ✅ Yes Pay for GB + IOPS separately
    io2 SSD Provisioned IOPS SSD (High Durability) Tier-1 DBs, mission-critical apps 4 GiB–64 TiB Up to 256,000 4,000 MiB/s* (with io2 Block Express) ✅ Yes Expensive, highest SLA and durability
    st1 HDD Throughput Optimized HDD Big data, logs, streaming workloads 500 GiB–16 TiB Up to 500 500 MiB/s ❌ No Much cheaper per GB, no good for random access
    sc1 HDD Cold HDD (Magnetic) Archive, rarely accessed data 500 GiB–16 TiB Up to 250 250 MiB/s ❌ No Cheapest option, not for performance workloads

    *Note: io2 Block Express supports enhanced throughput and IOPS.

  • EBS Multi-Attach lets you attach a single io1 or io2 volume to multiple EC2 instances at the same time . It’s like having a shared hard drive accessible by multiple servers simultaneously, but with block-level storage performance. Use cases are: High-availability clustered applications (e.g., shared storage for a database cluster), Workloads requiring multiple instances to read/write the same data concurrently with low latency, Avoiding downtime when failing over between instances — all can access the same volume. Points to note:
    • Supported volume types: io1 and io2 (not gp2/gp3 or HDD types).
    • Limited to instances within the same Availability Zone (AZ). You can’t attach across AZs.
    • Limited to 16 EC2 instances at a time
    • EC2 instances must be Nitro-based (modern generation) to support Multi-Attach.
    • You need your application or file system to handle concurrent access properly to avoid data corruption (think clustered file systems or specialized databases - not XFS, EXT4).
  • A snapshot is a backup of your EBS volume stored in S3 (although you can’t see or manage them directly in S3 - it happens under the hood). Some points to note on snapshots:
    • Volume must not be detached to create a snapshot - it supports live snapshots.
    • Snapshots are incremental (only changes since last snapshot are saved). First snapshot is a full backup.
    • Use cases for snapshots: Create new volumes from a snapshot; copy snapshots to other AZs or regions for disaster recovery
    • Store long-term backups using EBS Snapshot Archive, which offers cheaper storage with longer retrieval times.
    • Fast Snapshot Restore (FSR) lets you enable instant full-performance access to restored volumes (extra cost).
    • Deleted snapshots can be retained in a Recycle Bin for a defined retention period, which helps with accidental deletions.
  • Amazon EBS encrypts your data at rest on the volume and in transit between the volume and the EC2 instance. This means your data is protected both when stored and when being read/written.
    • Encryption is fully managed and transparent to you. You don’t have to change your applications or manually encrypt/decrypt data—AWS handles it seamlessly.
    • Encryption has minimal impact on latency or performance, thanks to AWS’s hardware acceleration for AES-256 encryption.
    • EBS encryption uses AWS Key Management Service (KMS) keys, typically with AES-256 encryption under the hood. You can use either AWS-managed keys or your own customer-managed keys.
    • Snapshots created from an encrypted volume are also encrypted automatically.
    • Any volume created from an encrypted snapshot will inherit the encryption.
    • You cannot directly encrypt an existing unencrypted EBS volume. Instead, you:
      1. Take a snapshot of the unencrypted volume.
      2. Copy the snapshot and enable encryption on the copy.
      3. Create a new encrypted EBS volume from the encrypted snapshot.
      4. Attach the new encrypted volume to your instance, replacing the unencrypted one.

EC2 Instance Store

Instance Store is a type of temporary block-level storage physically attached to the host server running your EC2 instance. Unlike EBS, the data on instance store does not persist after the instance is stopped, terminated, or fails. It’s often referred to as ephemeral storage.

Key Characteristics:

  • Extremely fast I/O performance (ideal for temp-heavy operations).
  • No additional cost—included with certain instance types (e.g., i3, d2).
  • Data is lost when instance stops or terminates—not suitable for persistent data.
  • Cannot be detached or moved like EBS.
  • Not all instance types support instance store—check before launching.
  • You must manually manage data durability (e.g., by copying important output to S3 or EBS).

Use Cases:

  • Temporary scratch space or buffers (e.g., during video processing or batch jobs).
  • High-speed cache or swap space.
  • Temporary storage for data replicated elsewhere (e.g., part of a Hadoop or Spark cluster).

EFS

Amazon EFS is a fully managed, elastic, shared file storage service for use with AWS compute like EC2, containers (ECS/EKS), and Lambda. It offers POSIX-compliant file storage, making it ideal for Linux-based applications needing shared access.

Features:

  • EFS can be mounted to multiple EC2 instances at the same time, even across multiple Availability Zones. This makes it ideal for scalable, distributed applications.
  • EFS is a regional service by default — meaning it automatically stores data redundantly across multiple AZs. This provides high availability and durability without manual configuration.
  • Highly available & durable (99.999999999% durability).
  • Scales automatically — no capacity planning needed.
  • Expensive compared to EBS or S3, especially if not optimized (e.g., using EFS-IA can cut costs).
  • Pay-per-use: Charged per GB stored per month, and per GB transferred (in some cases).
  • EFS uses NFS v4.1/v4.2 (Network File System), which is well supported by most Linux systems. It does not support Windows file systems (for that, consider FSx for Windows File Server)
  • You control network access via security groups (just like EC2).
  • Must allow NFS port 2049.
  • IAM policies can also control access for mounting and managing file systems.
  • Encryption at rest using AWS KMS (enabled by default). In-transit uses TLS.

Performance modes:

At creation time, choose between:

  • General Purpose (default) - low latency - best for most apps: web servers, content management, etc.
  • Max I/O - Higher latency - Use for high-throughput workloads: big data, analytics, media processing.

Throughput modes:

  • Bursting (default) - scales with storage size. Good for unpredictable workloads. Limited base throughput; bursts up based on credits.
  • Provisioned: Set a fixed throughput regardless of size. Use for stable, high-performance apps (e.g., ML training).
  • Elastic (new): Auto-scales throughput based on workload in real time. Best for apps with variable or unknown performance patterns.

Storage classes and Lifecycle Management

  • Standard: For frequently accessed files.
  • EFS-IA (Infrequent Access): 92% cheaper than standard for data not accessed often.
  • Lifecycle Policies: Automatically move files to EFS-IA after N days of inactivity (e.g., 30 days). Easy cost optimization — can be set per file system.

Availability & Durability Options

  • Standard (Regional): Multi-AZ. High durability (11 9’s). Use for production.
  • One Zone: Lower durability, cheaper (~47% less). Good for dev/test or apps with their own replication.

Elasticache

Amazon ElastiCache is a fully managed, in-memory data store and cache service from AWS. It supports two popular open-source caching engines:

  • Redis: Offers advanced data structures like lists, sets, sorted sets, and more. Supports persistence, pub/sub, and replication.
  • Memcached: A simpler, in-memory key-value store designed for horizontal scalability and high performance.

Why Do You Need It?

Caching is all about speed. Instead of querying a database every time you need data (which is slow and resource-intensive), you keep frequently accessed data in memory. This reduces database load, increases response times and improves app scalability. It makes apps stateless, which means that the application doesn’t store any session data or user information locally. Instead, that data is stored externally, like in ElastiCache. This means any instance of your app can handle any user request—making your architecture much more scalable.

With ElastiCache, AWS takes care of:

  • Operating system maintenance and patching
  • Monitoring and failover
  • Hardware provisioning
  • Automatic recovery

Any downsides to using ElastiCache?

Yes, to use caching effectively, you typically need to update your application code to check the cache before hitting the database. This may involve integrating Redis/Memcached libraries into your app.

How It Works in a Typical Architecture

  1. Your app checks ElastiCache for the data it needs.
  2. If the data exists in the cache (a “cache hit”), it’s returned immediately.
  3. If the data is not in the cache (a “cache miss”), the app fetches it from the RDS database, stores it in ElastiCache, and returns it.

This pattern dramatically reduces load on RDS.

Use Case: User Session Store Redis is ideal for storing user sessions due to its in-memory speed and support for data expiration. This helps maintain performance even under high traffic.

Cache Invalidation Strategy

Keeping your cache in sync with your database is key. Common strategies:

  • Time-based expiration: Set a time-to-live (TTL) for cached data.
  • Manual invalidation: Delete or refresh cache when underlying data changes.
  • Write-through: Automatically update the cache whenever the database is updated.

Redis vs. Memcached in ElastiCache

Feature Redis Memcached
Data Types Rich (strings, sets, lists, etc.) Simple key-value only
Persistence Yes No
Multi-AZ Yes No
Read Replicas Yes No
Backups Yes No
Multi-threaded No (single-threaded) Yes
Use Case Complex data caching, session store Simple high-speed caching

Security in ElastiCache

  • Redis: Supports IAM authentication, username/password, and encryption in-transit and at-rest.
  • Memcached: Only supports username/password; no built-in encryption.
  • IAM policies can control access to ElastiCache API operations, such as creating or deleting clusters, but not direct access to the cache data. That access must be controlled through Redis/Memcached authentication mechanisms.

Caching Patterns

  • Lazy Loading: Only load data into the cache on demand (after a cache miss).
  • Write-Through: Data is written to both the cache and the database at the same time.
  • Session Store: Store session data in Redis so your app can remain stateless and scalable.

Real-life Use Cases

  • Gaming leaderboards using Redis Sorted Sets
  • Real-time analytics
  • Session storage
  • Application state caching
  • Content management systems (CMS)

AWS FSx

While Amazon S3 is great for object storage, some workloads — especially legacy applications — need file systems with POSIX compliance, SMB or NFS protocols, Windows-style permissions, or high IOPS. That’s where Amazon FSx comes in. Amazon FSx is a fully managed service that provides several types of file systems, optimized for different use cases.

There are four main FSx options, each built for a specific workload:

FSx Type Backing Technology Use Case
FSx for Windows File Server Windows Server (NTFS) Windows-based apps needing SMB access, AD integration
FSx for Lustre Lustre (open-source) High-performance computing (HPC), ML, large-scale analytics
FSx for NetApp ONTAP NetApp ONTAP Enterprise workloads, hybrid cloud, advanced storage features
FSx for OpenZFS OpenZFS Linux-based apps needing ZFS features (snapshots, clones, etc.)

FSx for Windows File Server

  • Uses SMB protocol
  • Integrated with Microsoft Active Directory (self-managed or AWS Managed AD)
  • Supports NTFS permissions
  • Great for lift-and-shift of Windows apps to the cloud
  • Data is stored across multiple AZs (Multi-AZ option available)

Use case example: An on-prem Windows app that needs file shares using SMB.

FSx for Lustre

  • High-speed, low-latency file system for HPC workloads
  • Can be linked to S3 buckets (use S3 as durable storage, FSx for processing)
  • Sub-millisecond latencies, hundreds of GBps throughput
  • Ideal for ML, genomics, video rendering, financial risk modeling

Use case example: Training a machine learning model on a massive image dataset stored in S3.

FSx for NetApp ONTAP

  • Fully managed NetApp file system
  • Supports SMB and NFS, multiprotocol access
  • Built-in deduplication, compression, snapshots, cloning
  • Great for enterprise apps and hybrid cloud architectures

Use case example: You need multiprotocol support and advanced data management for a shared app environment.

FSx for OpenZFS

  • POSIX-compliant file system with ZFS features (snapshots, copy-on-write, cloning)
  • Supports NFS protocol
  • Ideal for Linux-based workloads that need advanced storage features

Use case example: A Linux-based analytics app that needs versioned data and fast cloning.

Security and Access

  • IAM is used to manage who can create and manage file systems
  • Access to the file system is controlled via VPC security groups + file system-level permissions (e.g. NTFS, NFS)
  • FSx file systems live inside a VPC, and use ENIs (Elastic Network Interfaces) to be accessed from EC2 or on-prem via VPN/Direct Connect

Summary

Feature FSx for Windows FSx for Lustre FSx for NetApp FSx for OpenZFS
Protocol SMB Lustre SMB + NFS NFS
AD integration Yes No Yes No
Use case Windows apps HPC, ML Enterprise hybrid cloud Linux apps
S3 integration No Yes Indirect No
Advanced features NTFS, AD High IOPS Snapshots, clones ZFS snapshots, clones
         

AWS Snow Family

Sometimes, uploading terabytes or petabytes of data to AWS over the internet is simply too slow, too expensive, or not even possible due to bandwidth constraints or security needs. That’s where the AWS Snow Family comes in — a set of physical devices you can order from AWS to transfer data offline.

The Snow Family is part of the AWS edge computing and data migration suite, and consists of:

  1. Snowcone
  2. Snowball Edge
  3. Snowmobile

Snowcone

  • The smallest Snow device — portable, rugged, and about the size of a tissue box
  • 8 TB of usable storage
  • Can run edge compute tasks using AWS IoT Greengrass or EC2-compatible AMIs
  • Often used in edge environments (e.g., disconnected, mobile, or rugged deployments)

Use cases: Remote sensors, mobile clinics, ships, field deployments, temporary edge compute

Snowball Edge

Comes in two options:

| Variant | Description | | — | — | | Snowball Edge Storage Optimized | Up to 80 TB of usable storage, designed for large-scale data transfer | | Snowball Edge Compute Optimized | Less storage (~42 TB) but includes GPU/CPU power for local compute |

  • Supports encryption, tamper-resistant enclosures, and E Ink shipping labels that auto-update for return.
  • You can even run EC2 instances and Lambda functions on these devices for pre-processing data at the edge, before sending to AWS.
  • Use cases: Large-scale offline data migration, video analytics, industrial environments, military/remote compute

Snowmobile

  • A literal shipping container (45-foot long) loaded onto a tractor-trailer truck
  • Offers up to 100 petabytes of storage
  • Requires onsite coordination with AWS personnel for installation and security
  • Use cases: Massive data center migrations to AWS (think: genomics, film studios, legacy backup vaults)

Security

  • All Snow devices encrypt data at rest using AWS KMS with 256-bit encryption keys
  • Devices are tamper-resistant, and data is erased securely when AWS receives the device back
  • Chain-of-custody is enforced and monitored

How it Works (for Snowcone/Snowball)

  1. You create a job in the AWS Console or via CLI
  2. AWS ships the device to you with the job preloaded
  3. You connect it to your local network and copy the data
  4. You ship the device back, and AWS automatically uploads the data to S3
  5. For export jobs, the reverse applies — AWS loads data onto the device from your S3 buckets and ships it to you.

Summary

Device Storage Compute Use Cases
Snowcone 8 TB Basic (IoT, edge compute) Portable edge, disconnected field ops
Snowball Edge Storage Optimized ~80 TB Limited Bulk data transfer
Snowball Edge Compute Optimized ~42 TB Advanced (EC2, GPU, Lambda) Edge analytics, ML at the edge
Snowmobile 100 PB None Entire data center lift-and-shift

Integrating Snowball with Glacier for Long-Term Archival

For organizations archiving massive amounts of cold data — like compliance records, raw scientific data, or historical logs — Snowball can be used as an on-ramp to Amazon S3 Glacier. Here’s how you can architect that flow:

  1. Initiate a Snowball import job via the AWS Console or CLI.
  2. Specify an S3 bucket as the import destination.
  3. Preconfigure the destination bucket or the folder path (prefix) with a lifecycle policy that transitions data to:
    • S3 Glacier Flexible Retrieval (for infrequent access, cheaper storage)
    • or S3 Glacier Deep Archive (for compliance-grade, ultra-low-cost storage)
  4. When AWS receives the device, your data lands in S3, and the lifecycle policy kicks in, transitioning the data to Glacier.

AWS Storage Gateway

AWS Storage Gateway is a hybrid cloud storage service that connects your on-premises applications with AWS cloud storage — offering a smooth and secure bridge between local infrastructure and scalable cloud-based storage. It’s ideal when you want to use AWS storage services like S3, Glacier, or EBS, but your apps are still running in a local data center or on-premises servers. Think of Storage Gateway as a smart courier that shows your on-premises systems a local door to cloud storage. Behind the scenes, it handles the heavy lifting of syncing, caching, and format conversion to match the cloud.

There are three different types of gateways, each designed for a specific kind of use case:

1. File Gateway

  • Use case: Replacing or extending on-prem file servers (like NAS) with cloud-backed storage.
  • How it works: Files are stored as objects in Amazon S3 (excluding Glacier)
  • Access protocol: NFS (Linux) or SMB (Windows, with AD Integration)
  • Key features:
    • Frequently accessed files are cached locally for low-latency access.
    • Ideal for backups, archiving, and machine learning workloads needing file access to S3.
  • Use cases:
    • Backing up file shares to S3.
    • ML workloads needing S3 access in file format.

2. Volume Gateway

  • Use case: Block storage volumes for applications like databases or virtual machines.
  • Modes:
    • Cached Volumes: Frequently accessed data is stored locally, full dataset is in AWS.
    • Stored Volumes: Entire dataset is stored on-prem, and async snapshots are backed up to AWS.
  • Backups: EBS snapshots in the cloud, providing point-in-time recovery and off-site protection.
  • Access protocol: iSCSI for block-level access
  • Use cases:
    • Running on-prem apps with cloud backup.

3. Tape Gateway

  • Use case: Replacing physical tape libraries with virtual tapes stored in AWS.
  • How it works: Emulates a tape library using existing backup software, and stores backups in Amazon S3 or Glacier.
  • Access protocol: iSCSI VTL
  • Great for: Long-term archival and compliance-driven backup environments.
  • Use cases:
    • Replacing off-site tape storage with a cheaper, simpler AWS-based model.

Security and Integration

  • Encryption: All data is encrypted in transit and at rest using AWS KMS.
  • Active Directory integration: File Gateway can use Microsoft AD for authentication and SMB access control.
  • Monitoring: CloudWatch metrics and logs, plus AWS CloudTrail for API-level activity.

Deployment and Architecture Notes

  • Can run as a virtual machine (VM) on VMware, Hyper-V, or as a hardware appliance from AWS.
  • File Gateway can also run as an Amazon EC2 instance, if you’re already inside AWS.
  • Local caching is crucial — you need disk space on-prem for fast access and buffering.
  • Compatible backup software (like Veeam or Veritas) can work with Tape Gateway seamlessly.

AWS Transfer Family

AWS Transfer Family is a fully managed service that allows you to transfer files directly into and out of Amazon S3 or Amazon EFS using common file transfer protocols — SFTP, FTPS, and FTP — without needing to manage your own file transfer servers. It’s designed to make migration or integration with legacy systems easier, especially for organizations that rely on traditional file transfer workflows but want the benefits of AWS storage.

Key Features

  • Protocol support:
    • SFTP (Secure File Transfer Protocol – SSH-based)
    • FTPS (FTP Secure – SSL/TLS based)
    • FTP (unencrypted – not recommended for production)
  • Backend storage options:
    • Amazon S3 (object storage)
    • Amazon EFS (file system for Linux-based workloads)
  • Fully managed: AWS handles high availability, patching, scaling, and monitoring.
  • Integrates with:
    • IAM for access control
    • CloudWatch for monitoring
    • CloudTrail for auditing
    • KMS for data encryption
    • Route 53 for custom domain routing

How It Works

When a user uploads a file via SFTP, FTP, or FTPS, AWS Transfer Family securely accepts the connection and writes the file to your Amazon S3 bucket or EFS file system, depending on your configuration.

Access can be controlled via:

  • Service-managed identities (simple, good for fewer users)
  • Custom identity providers via Amazon API Gateway and Lambda (for LDAP or Active Directory-backed user bases)

Common Use Cases

  1. Legacy system migration

    Move your traditional SFTP-based file transfers to AWS without changing existing workflows.

  2. EDI (Electronic Data Interchange)

    Enterprises exchanging files with partners using SFTP or FTPS can now store and process data in AWS.

  3. Partner data exchange

    Allow third parties (clients, vendors) to upload/download files securely into S3 or EFS.

  4. Data ingestion pipelines

    Automate file intake for processing by AWS services (e.g. trigger Lambda, move to S3 Glacier, etc.)

Security

  • Data in transit is secured with SFTP/FTPS encryption.
  • Data at rest is stored in S3 or EFS with optional KMS encryption.
  • Access is controlled via IAM roles, scoped-down policies, and per-user home directories.
  • You can’t use Transfer Family directly to transfer files between S3 and EFS — it’s either/or per server setup.
  • You can set up DNS aliases (via Route 53 or external DNS) for custom domain access.
  • CloudWatch metrics let you monitor upload/download activity, connection attempts, and errors.
  • Billing is based on:
    • Protocol enabled (SFTP, FTPS, FTP)
    • Number of hours the endpoint is active
    • Data uploaded/downloaded

AWS DataSync

AWS DataSync is a managed service that helps you move large amounts of data between on-premises storage and AWS, or between AWS services. It’s ideal for migrations, data archiving, disaster recovery, and syncing data for analytics — all with automation, encryption, monitoring, and fine-tuned control.

What Can DataSync Do?

  • Transfer data from:
    • On-premises NFS or SMB file shares
    • Amazon S3 buckets
    • Amazon EFS
    • Amazon FSx (Windows File Server or Lustre)
  • Transfer data to:
    • Amazon S3 (including Glacier)
    • Amazon EFS
    • Amazon FSx
    • On-prem NFS/SMB

It’s bidirectional and handles millions of files or petabytes of data with ease.

How It Works

At a high level:

  1. You install a DataSync Agent (VM) on-prem if you’re moving data from on-prem to AWS.
  2. You create a source location and a destination location.
  3. You create and configure a task that runs manually or on a schedule.
  4. DataSync handles the transfer — securely, efficiently, and with automatic retry logic.

Offline Transfers with Snowcone

If you don’t have network capacity:

  1. Order a Snowcone device with the DataSync agent pre-installed.
  2. Copy data from your local NFS/SMB file share onto Snowcone via DataSync.
  3. Ship the device back to AWS.
  4. AWS uploads the data into your target S3 bucket or file system.
  5. You can continue syncing incrementally over the network later if needed.

Key Features

  • Encryption in transit using TLS.
  • Data validation after transfer to ensure integrity.
  • Incremental syncs — only changed files are re-transferred.
  • Task scheduling for automated transfers.
  • Bandwidth throttling and exclude/include filters for more control.
  • CloudWatch integration for monitoring task progress and troubleshooting.

Use Cases

  • One-time migrations of large datasets to AWS.
  • Recurring backups from file shares to Amazon S3.
  • DR replication from on-prem storage to FSx for Windows.
  • Syncing data lakes or analytics datasets across regions or accounts.
  • Moving data to S3 Glacier or Glacier Deep Archive (via S3 lifecycle after transfer).

Protocol & Storage Support

Source/Destination Supported Protocol
On-prem NFS NFS v3/v4
On-prem SMB SMB v2/v3
Amazon EFS NFS
Amazon FSx Windows SMB
Amazon S3 S3 API

Pricing (High-Level)

  • Charged per GB transferred, depending on region.
  • No charge for the agent itself.
  • CloudWatch monitoring and logs may incur additional charges.

Pro Tips

  • You must use the agent for on-prem transfers. AWS provides prebuilt VMs for VMware, Hyper-V, and EC2.
  • File metadata (timestamps, sizes, permissions) is generally preserved when source and destination support it.
  • POSIX permissions, user IDs, group IDs, and ACLs are preserved when transferring between compatible file systems (e.g. EFS ↔ NFS).
  • S3 does not support traditional file permissions, so metadata preservation is limited to timestamps, sizes, and custom metadata — not POSIX attributes.
  • For transfers within AWS (e.g. S3 to FSx), no agent is needed.
  • DataSync is not real-time — but can be scheduled frequently.

Typical Architecture Example

Let’s say you have a file share on-prem and want to back it up to Amazon S3 daily:

  1. Install the DataSync agent on your local hypervisor.
  2. Register the agent in the AWS console.
  3. Define the SMB/NFS share as the source, and the S3 bucket as the destination.
  4. Create a DataSync task and schedule it to run nightly.
  5. DataSync handles the rest — securely and efficiently.

Amazon Athena

Athena is a serverless, interactive query service that lets you analyze data directly in S3 using standard SQL. There’s no infrastructure to provision or manage — you just point Athena at your data in S3, write SQL queries, and get results within seconds. Think of S3 as a giant warehouse full of documents (your data files), and Athena as a librarian with a notepad. You can walk in, ask the librarian a question like: “Can you find me all sales transactions over R10,000 from last year?” And within seconds, the librarian (Athena) brings you a clean list — without having to copy or rearrange the documents. That’s what Athena does with your data in S3.

Supported File Formats

Athena supports a wide variety of structured and semi-structured data formats:

  • CSV
  • JSON
  • Parquet
  • ORC
  • Avro

Using columnar formats like Parquet or ORC is highly recommended — it improves performance and reduces cost.

How It Works – Under the Hood

  1. Data stays in S3 — no movement, no ingestion.
  2. You use SQL to query it (Athena uses the Presto engine under the hood).
  3. You define tables and schemas using a Glue Data Catalog (or directly in Athena).
  4. You pay per query, based on the amount of data scanned.

Tip: Use partitions and columnar formats to minimize the amount of data scanned, saving money and improving speed.

Use Cases

  • Ad-hoc querying of logs or large datasets (e.g., CloudTrail logs, ELB logs, clickstream data)
  • Quick exploration of CSV exports from third-party tools
  • Business analytics over static S3 data lakes
  • Security analysis and reporting
  • Federated Query allows you to run SQL queries across data stored in relational, non-relational, object and custom data sources (on-prem or AWS). It uses Data Source Connectors that run on AWS Lambda and stores the results back in S3.

Pricing

  • You pay per query: specifically, $5 per TB scanned.
  • Compress your data, use Parquet/ORC, and partition by date/type to save costs.

Integration with Glue

Athena can use AWS Glue as a central data catalog. Think of it as your “metadata phone book” — it keeps track of what data you have and how it’s structured.

Security

  • Data is never moved, and stays in S3.
  • Supports encryption at rest and in transit.
  • Integrates with IAM for access control.
  • You can limit users to specific queries or databases.

AWS Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, based on PostgreSQL. It is designed for Online Analytical Processing (OLAP), which means it’s optimized for running complex queries across large volumes of data, typically used for business intelligence, analytics, and reporting. Unlike traditional relational databases optimized for transactions (OLTP), Redshift is meant for reading and analyzing data rather than writing lots of small transactions. Think of Redshift as a giant Excel spreadsheet in the cloud, but instead of rows and rows of data stored in a table, it’s organized in columns to make it much faster when you’re only interested in a few columns at a time. And instead of one person filtering and calculating, it’s hundreds of computers doing it all at once.

Key Concepts

  • Columnar Storage: Redshift stores data in columns rather than rows. This allows it to read only the relevant columns needed for a query, which speeds up performance dramatically for analytical workloads.
  • Massively Parallel Processing (MPP): Redshift distributes queries across multiple nodes in a cluster. Each node processes part of the data simultaneously, significantly improving query speed on large datasets.
  • Redshift Clusters: A Redshift cluster consists of a leader node (handles query planning and coordination) and one or more compute nodes (perform the actual work).

How Data Gets Into Redshift

You can load data into Redshift from:

  • Amazon S3 (most common, often via AWS Glue or COPY command)
  • Amazon RDS
  • Amazon DynamoDB
  • Other sources via AWS Data Migration Service (DMS)
  • Third-party ETL tools

You often stage your data in S3 (e.g., using AWS Glue, DataSync, or even Snowball for massive loads) and then load it into Redshift.

Redshift Spectrum

This allows you to run SQL queries directly on data stored in S3 without loading it into Redshift, as long as it’s in a compatible format (e.g., Parquet, ORC, CSV). It’s ideal for analyzing “cold” data stored cheaply in S3, alongside “hot” data already in Redshift. You define an external schema (usually via AWS Glue Data Catalog) and query it just like a normal Redshift table.

Use Cases

  • Business intelligence and dashboarding
  • Customer analytics
  • Operational reporting
  • Data lake queries (via Redshift Spectrum)

Snapshots

Amazon Redshift automatically takes automated snapshots of your data on a daily basis, and you can also take manual snapshots at any time. These snapshots are incremental and stored in S3.

  • Automated snapshots: Retention period can be set (default is 1 day, max 35 days).
  • Manual snapshots: Retained until you explicitly delete them.
  • Storage costs: You only pay for the data that has changed since the last snapshot (incremental).

Snapshots can be used to restore your cluster to a specific point in time—either to the same AWS Region or to another Region for disaster recovery.

Disaster Recovery (DR)

For DR, you can:

  • Copy snapshots across Regions to have a backup in a separate AWS Region.
  • Restore a cluster from a snapshot in another Region if the original Region becomes unavailable.
  • Combine with cross-region snapshot copy and scheduled snapshot automation to create a basic DR strategy.

This setup allows you to spin up a new Redshift cluster in another Region using your latest snapshot—this is a cold standby DR solution, suitable for analytics where immediate failover is not critical.

Security

  • Data is encrypted at rest using KMS.
  • Encryption in transit is supported using SSL.
  • VPC support means you can run Redshift in your private network.
  • IAM policies, resource-based policies, and security groups help control access.

Pricing and Performance Tips

  • You pay for the size and number of nodes in your cluster.
  • Reserved Instances can reduce cost.
  • Use compression (automatically applied when you use the COPY command from S3).
  • Use sort keys and distribution keys wisely to optimize query performance.

AWS OpenSearch Service

Amazon OpenSearch Service is a fully managed service that makes it easy to deploy, operate, and scale OpenSearch (formerly Elasticsearch) clusters in the cloud. It’s used for searching, analyzing, and visualizing large volumes of data in near real-time.

Common use cases include:

  • Log and event analytics (e.g. app or infrastructure logs)
  • Full-text search for websites or applications
  • Security information and event management (SIEM)
  • Observability dashboards

Core Components

  • OpenSearch Cluster: A collection of nodes (instances) that store and index your data.
  • Index: Similar to a table in a database, where documents (JSON format) are stored.
  • Document: A single unit of searchable data in JSON.
  • Domain: An OpenSearch deployment managed by AWS.

Ingesting Data

Data is ingested via:

  • APIs (RESTful)
  • AWS services like Kinesis Data Firehose, Logstash, AWS Lambda
  • Integration with CloudWatch Logs

Searching and Querying

Once indexed, data can be queried using the OpenSearch Query DSL, or explored visually using OpenSearch Dashboards (previously Kibana). You can search structured and unstructured data—text, logs, metrics—very efficiently.

Integration with AWS Services

OpenSearch Service integrates with:

  • CloudWatch (monitoring)
  • IAM (fine-grained access control)
  • Kinesis Firehose (data streaming into OpenSearch)
  • Lambda (real-time log enrichment or transformation)
  • S3 (via tools like Logstash or Data Prep workflows)

High Availability and Durability

You can configure:

  • Multiple Availability Zones (AZs) for high availability
  • Dedicated master nodes to improve cluster stability
  • Snapshots to S3 for backup and restore (automated or manual)

Security

OpenSearch Service provides:

  • Fine-grained access control (based on roles)
  • Encryption at rest (via AWS KMS)
  • TLS encryption in transit
  • VPC support (so you can run the service entirely inside your private network)

Monitoring

You can monitor cluster health and performance using:

  • OpenSearch Dashboards
  • CloudWatch metrics
  • Slow logs for indexing and querying

AWS EMR

Amazon EMR (Elastic MapReduce) is a managed big data platform that allows you to process and analyze huge amounts of data using open-source frameworks like Apache Spark, Hadoop, Hive, HBase, Flink, and Presto. It abstracts the complexity of provisioning and managing the compute cluster so you can focus on your data processing logic.

You typically use EMR for:

  • Data transformation (ETL jobs)
  • Log analysis
  • Machine learning workloads
  • Batch processing
  • Ad hoc querying of large datasets

How It Works

At its core, EMR launches a cluster of EC2 instances where each node plays a specific role:

  • Master node: Coordinates the cluster and tracks job progress.
  • Core nodes: Handle data processing and store data using HDFS (Hadoop Distributed File System).
  • Task nodes (optional): Only process data but don’t store it.

You can also use EMR Serverless, where you don’t manage any infrastructure and simply submit Spark or Hive jobs—AWS handles scaling and provisioning.

Data Sources

EMR supports a range of data sources:

  • S3 is the most common—used as the main storage layer (known as data lake architecture).
  • Can also read from DynamoDB, JDBC, RDS, Redshift, and Kinesis.

Because it works well with S3, EMR lets you separate compute from storage, allowing you to stop the cluster without losing your data.

Pricing

Pricing is based on:

  • The number and type of EC2 instances used
  • Duration of use
  • EMR adds a small fee on top of EC2 pricing

You can save significantly using Spot Instances for non-critical or fault-tolerant jobs.

Use Case Examples

  • Transforming raw clickstream data into Parquet or ORC format stored in S3
  • Running Spark-based machine learning pipelines on large volumes of text or sensor data
  • Querying data in S3 using Presto or Hive on a temporary cluster

AWS QuickSight

Amazon QuickSight is a fully managed Business Intelligence (BI) service that lets you create and share interactive dashboards and visualizations. It’s designed for quick setup, integration with AWS data sources, and fast rendering using a memory-optimized engine called SPICE.

Key Features

  • SPICE Engine: Super-fast, in-memory engine for performing quick queries and rendering dashboards. It scales automatically behind the scenes.
  • Data Sources: You can connect to a variety of data sources, including S3 (via Athena), RDS, Redshift, Aurora, Snowflake, and even on-premises databases using a QuickSight data connector/agent.
  • Visualizations: Supports a wide range of charts like bar, line, pie, maps, KPIs, pivot tables, and more.
  • Sharing: Dashboards can be shared with users via email links or embedded into web apps (with row-level security if needed).
  • Pay-per-session pricing: You don’t pay for user seats. Instead, you can pay only for actual usage if you choose the reader-based model.

SPICE vs Direct Query

  • SPICE: Cached data stored in-memory, fast performance, good for repeated queries.
  • Direct Query: Data is queried live from the source. Useful for up-to-the-minute data but can be slower and limited by data source throughput.

Security and Permissions

  • Uses IAM for access control and can integrate with Active Directory via AWS SSO.
  • Can enforce column-level security to show users only the data they are permitted to view.

Use Cases

  • Creating interactive dashboards over data in Athena or Redshift
  • Giving business users insights into app metrics without exposing raw data
  • Providing embedded analytics in customer-facing portals

Architecture Example

You might store data in S3, query it with Athena, and use QuickSight to visualize that data. It pulls from SPICE if cached or fetches live via Direct Query. Dashboards can be embedded in internal tools or customer portals.


AWS Glue

AWS Glue is a fully managed ETL (Extract, Transform, Load) service that helps you prepare and move data for analytics, machine learning, and application development. It’s serverless, meaning you don’t need to manage any infrastructure, and it’s tightly integrated with other AWS services like S3, Redshift, Athena, and Lake Formation.

In short, Glue helps you clean, prepare, transform, and catalog your data—without having to write everything from scratch or spin up your own servers.

Key Components of AWS Glue

Data Catalog

The Glue Data Catalog is a central metadata repository where your table definitions, schemas, partitions, and job metadata are stored. It’s like the index of a massive library. Services like Athena and Redshift Spectrum use the Glue Catalog to understand your data stored in S3.

Example: You have a bunch of CSV files in S3. Glue can crawl those files, detect their structure, and store that info in the Data Catalog as a table. Then Athena can query that data using SQL as if it were in a database.

Crawlers

A crawler connects to your data source, detects the schema automatically, and registers tables and partitions in the Data Catalog. You can schedule crawlers to run on a recurring basis to keep metadata up to date.

Use case: You drop new JSON files into S3 every day. A crawler keeps the table schema current so your queries keep working without manual updates.

Jobs

Glue Jobs are the ETL workflows. They extract data from one or more sources, apply transformations, and then write the data to a target. Jobs can be written in Python or Scala using Apache Spark behind the scenes.

You can build jobs in:

  • Glue Studio (visual drag-and-drop interface)
  • Notebooks (interactive development)
  • Script editor (write code directly)

You can also use Glue Python Shell Jobs, which run lightweight scripts that don’t need Spark—ideal for things like file format conversions or moving metadata around.

Triggers

Triggers are used to start Glue jobs based on a schedule, on-demand call, or in response to an event (like the completion of another job). They help automate your data pipeline.

Workflows

A workflow lets you chain multiple jobs and triggers into a directed acyclic graph (DAG). You can visualize, monitor, and track the execution status of complex pipelines.

Use Cases

  • Preparing data stored in S3 for querying with Athena or Redshift Spectrum
  • Cleaning and transforming logs, clickstream, or transactional data for analytics
  • Migrating data between databases
  • Building a data lake or modern data warehouse architecture

Glue and Lake Formation

Glue’s Data Catalog is also used by AWS Lake Formation, which adds access control, governance, and fine-grained permissions on top of Glue. If you’re building a secure data lake, Lake Formation and Glue work hand in hand.

Important Glue Concepts for Practice

  • Serverless Spark: Glue jobs are Spark-based but managed completely by AWS.
  • Partitioning: Glue handles partitioned data efficiently—for example, splitting data by date in S3 folders.
  • Job bookmarks: Prevent reprocessing of old data. Useful for incremental ETL.
  • Glue version: Determines which Spark/Python versions your job uses.
  • Glue DataBrew: clean and normalize data using pre-built transformation.
  • Glue Studio vs Notebooks: Studio is best for visual ETL; Notebooks are better for custom, interactive work.

AWS Lake Formation

AWS Lake Formation is a managed service that helps you build, secure, and manage a data lake on AWS quickly. A data lake is a centralized repository that lets you store all your structured and unstructured data at any scale, typically in Amazon S3.

Lake Formation focuses on simplifying and securing your data lake setup. It uses the Glue Data Catalog under the hood and works with services like Athena, Redshift Spectrum, and Amazon EMR.

Key Benefits

  • Centralized access control and security for S3 data
  • Automated data ingestion, transformation, and cataloging
  • Fine-grained permissions down to row, column, and table level
  • Unified governance across analytics services

How It Works

  1. Register a Data Lake Location

    You define the S3 buckets/folders where your data lake lives. Lake Formation takes control of permissions to enforce access control.

  2. Crawl and Catalog Data

    Use AWS Glue crawlers to scan your data and populate the Glue Data Catalog with tables and metadata.

  3. Set Permissions

    Use Lake Formation to assign permissions to users, groups, or roles on databases, tables, and even individual columns. This is done centrally, rather than manually assigning S3 bucket policies or IAM policies.

  4. Query with Analytics Services

    Once permissions are set, users can access the data securely using Athena, Redshift Spectrum, or EMR—all while respecting Lake Formation’s access controls.

Fine-Grained Access Control

Lake Formation allows you to control:

  • Who can see what columns (e.g. hide PII)
  • What rows a user can access (e.g. only their region’s data)
  • What tables or databases a user can query

Example: You have a sales dataset across multiple regions. With row-level access control, you can allow a user to only see data from their own region without creating multiple copies of the dataset.

Lake Formation Permissions vs IAM

IAM controls what services a user can use, while Lake Formation controls what data they can see inside the data lake. This allows for separation of duties—you don’t have to modify IAM policies every time you want to change data access.

Use Cases

  • Create secure self-service data lakes for analysts and data scientists
  • Implement fine-grained data access policies without duplicating datasets
  • Centralized governance for large teams and organizations
  • Enforce compliance requirements like GDPR by masking or blocking access to sensitive data

Lake Formation vs Glue

Feature Lake Formation Glue
Metadata Catalog ✅ (uses Glue)
Crawling
ETL Jobs
Data Security / Access Control
Row/Column-Level Security
Centralized Permissions

Amazon Managed Service for Apache Flink

Amazon Managed Service for Apache Flink is a fully managed service that lets you process streaming data in real time using Apache Flink, an open-source framework and engine for stateful stream processing. It’s designed to handle large-scale, low-latency stream processing workloads with minimal operational overhead.

What It Does

Apache Flink is powerful for use cases where you need to respond to data as it arrives — think dashboards, fraud detection, alerts, or live metrics. With this managed service, AWS handles the provisioning, scaling, patching, and fault tolerance, allowing you to focus on writing Flink applications (using Java, Scala, or SQL).

It integrates natively with AWS data streams such as:

  • Amazon Kinesis Data Streams
  • Amazon MSK (Managed Streaming for Apache Kafka)
  • Amazon S3 (for input/output or state backups)

Note: Flink cannot consume data from Kinesis Data Firehose. Firehose is designed to deliver data to storage and analytics services (e.g. S3, Redshift, OpenSearch), not to act as a source for real-time stream processing. If you need real-time stream processing with Flink, use Kinesis Data Streams or MSK. It can also send processed results to destinations like:

  • Amazon S3
  • Amazon Redshift
  • Amazon OpenSearch Service
  • Amazon Kinesis Data Firehose

Why Use It

Use cases include:

  • Real-time analytics dashboards
  • Fraud detection systems
  • Continuous metric evaluation
  • Event-driven applications

Unlike batch jobs (like Glue or EMR), Flink applications run continuously, processing data as it flows in.

Key Concepts

  • Streaming Applications: Applications are long-running and stateful.
  • Checkpointing: Built-in mechanism to periodically save application state to Amazon S3 for fault tolerance.
  • Autoscaling: The service can scale Flink applications automatically based on usage patterns (via Kinesis Data Analytics for Apache Flink).
  • Application Versions: Flink apps can be versioned and rolled back.
  • Metrics & Logs: Integrated with CloudWatch for monitoring.

Real Life Analogy

Think of Flink like a traffic controller at a busy airport (your data stream). Instead of waiting for the day to end (batch processing), it’s constantly routing and making decisions in real-time as planes (data events) arrive — redirecting, alerting, and logging as needed.

SQL Support

You don’t have to write complex Java code. Flink supports Flink SQL, which makes it easier to define simple stream processing logic using familiar SQL syntax. Example: You can filter a stream of events to show only where event_type = 'purchase' and aggregate the number of purchases per minute.


AWS MSK

Amazon MSK is a fully managed service for Apache Kafka, a popular open-source platform used to build real-time data pipelines and streaming applications. MSK handles the provisioning, setup, and ongoing maintenance of Kafka clusters, allowing you to focus on building your applications instead of managing infrastructure.

What is Apache Kafka?

Apache Kafka is a distributed streaming platform that lets you:

  • Publish and subscribe to streams of records (like a messaging system)
  • Store data reliably (Kafka keeps messages on disk for a configurable period)
  • Process streams of data in real time

Kafka organizes data into topics, and each topic can be broken down into partitions that scale horizontally.

What MSK Does

MSK provides a fully managed, highly available Kafka environment:

  • Handles provisioning, patching, monitoring, and failover
  • Offers integrated security with IAM, TLS encryption in transit, and encryption at rest
  • Supports Kafka-native APIs, so existing Kafka clients and tools just work
  • Integrates with other AWS services like Lambda, Flink, Kinesis Data Analytics, and OpenSearch

Use Cases

  • Real-time application and user analytics
  • Log aggregation from distributed systems
  • Event sourcing architectures
  • Streaming ETL pipelines
  • Ingesting high-throughput data for AI/ML models

Example: You can use MSK to ingest clickstream data from a website, process it in real time with Flink, and then store the output in S3 for analytics.

Key Concepts

  • Broker: Kafka nodes that receive and store messages.
  • Producer: Sends messages to a Kafka topic.
  • Consumer: Reads messages from a topic.
  • Partition: A horizontal division of data in a topic for scalability.
  • Replication: Ensures fault tolerance by duplicating data across brokers.

MSK supports replication across availability zones, ensuring high availability and durability.

Integration and Security

  • Can integrate with VPC, IAM, CloudWatch, CloudTrail
  • Supports PrivateLink for secure VPC access
  • Data is encrypted in transit (TLS) and at rest (KMS)
  • IAM authentication and SASL/SCRAM supported for secure client access

MSK vs MSK Serverless

Feature MSK MSK Serverless
Provisioning You manage brokers No provisioning required
Scaling Manual or auto-scaling Automatic
Cost Model Pay per broker/hour Pay per throughput usage
Use Case Fit Consistent throughput Spiky or unpredictable

MSK Serverless is great for teams that want Kafka without managing capacity at all.

Real-Life Analogy

Imagine Kafka as a postal sorting facility. Producers are customers dropping off packages (events), Kafka topics are the labeled conveyor belts, and consumers are delivery drivers who collect the sorted packages. MSK gives you the entire facility — running, secure, and ready — without needing to hire staff or fix machines.

AWS MSK vs Kinesis Data Streams

Kinesis Data Streams (KDS) and Amazon MSK (Managed Kafka) are both streaming data platforms on AWS, but they cater to slightly different needs:

Feature Kinesis Data Streams (KDS) Amazon MSK (Kafka)
Service Type AWS-native, fully managed Managed Kafka, open-source compatible
Client Compatibility AWS SDK, CLI, Kinesis Agent Kafka-native tools and APIs
Scaling Model Shards (manual or On-Demand mode) Brokers and partitions
Ordering Guaranteed within a shard Guaranteed within a partition
Retention Up to 365 days (default 24h) Configurable retention
Integration Tight with AWS services (Lambda, Firehose, etc.) Best for Kafka-based ecosystems
Use Cases Simple AWS-native streaming apps Migration from Kafka or need for Kafka ecosystem

When to Use Which?

  • Use Kinesis when you want quick integration with AWS services, need simple ingestion, and don’t want to manage partitions or brokers.
  • Use MSK when you’re migrating from or already using Kafka, need compatibility with Kafka tools, or want more control over the streaming architecture.

About the Author

Dawie Loots is a data scientist with a keen interest in using technology to solve real-world problems. He combines data science expertise with his background as a Chartered Accountant and executive leader to help organizations build scalable, value-driven solutions.

Back to Blog