DynamoDB Performance Tuning in High-Traffic E-commerce Applications

Apr 30, 2025

Great. I will explore DynamoDB performance tuning with a focus on partition management, indexing strategies (GSIs and LSIs), and best practices for high-throughput workloads. The research will include real-world e-commerce case studies, showcasing common pitfalls, diagnosis methods, and structured solutions. I will ensure a narrative-driven approach to make the concepts engaging and applicable to your POC.

I'll update you once I have detailed findings.

DynamoDB Performance Tuning in High-Traffic E-commerce Applications

Building a fast and scalable e-commerce platform on AWS often means leveraging Amazon DynamoDB for its single-digit millisecond response times at any scale (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL). However, achieving consistent performance under real-world conditions requires careful tuning. This exploration uses a narrative problem-solution approach, recounting real e-commerce scenarios where DynamoDB performance issues arose and how they were resolved. We’ll dive into partition management, indexing strategies (GSIs and LSIs), and best practices for high-throughput workloads. Each case study illustrates the problem, the investigation process, the scientific solution (with explicit tuning steps), and the performance outcomes.

Case Study 1: The Flash Sale Hot Partition Mystery

Problem & Symptoms: An online retailer (“MegaMart”) ran a flash sale that caused certain product pages to become extremely slow. Most shoppers experienced snappy responses, but a few popular items (flash deals) had high latency and occasional DynamoDB ProvisionedThroughputExceededException errors at checkout time. The DynamoDB table backing the cart service had a partition key design that inadvertently funneled heavy activity to a single partition. In DynamoDB, if a single partition key receives too many requests (more than about 3,000 read units or 1,000 write units per second), that key becomes a hot partition and can get throttled (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). In MegaMart’s case, a “Deals” item became a hot key during the sale, saturating one partition’s capacity and slowing down requests for that key.

Investigation: The engineering team first noticed a spike in DynamoDB latency in CloudWatch metrics and error logs showing ThroughputExceeded exceptions. By instrumenting the code, they logged the processing time of each DynamoDB query, which helped pinpoint that whenever a certain DealID was queried repeatedly in a short span, response times shot up (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL) (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL). This indicated one partition was overwhelmed. They also reviewed the access patterns and realized that the partition key (DealID) had low cardinality during the sale – effectively all users were hitting the same key. DynamoDB’s adaptive capacity was helping but not enough: DynamoDB adaptive capacity can automatically boost a hot partition’s share of throughput beyond its normal allocation (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog) (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog), allowing uneven traffic to be served indefinitely without errors as long as overall table throughput isn’t exceeded (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog). However, in this flash sale the single hot key was exceeding even those boosted limits. The root cause was a partition key design issue leading to an extreme hot spot.

Solution (Partition Management & Caching): The team addressed the hot partition in two ways. First, they decided to introduce a partition key shard for new events: by appending a random digit (0–9) to the DealID, they would spread writes and reads across 10 logical partitions instead of one. This write sharding technique is recommended for hot keys – for example, if one key needs ~5,000 writes/sec, using a range of 5–10 suffixes spreads the load and avoids hitting the 1,000 WCU per-partition limit (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). In practice, MegaMart’s developers updated their code so that each cart item entry used a key like DealID#<random_suffix>. They planned to query all suffixes in parallel when reading the cart (an acceptable trade-off for far greater throughput (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog)).

Secondly, to immediately alleviate the pressure on the hot item during the sale, the team deployed an in-memory cache in front of DynamoDB. Given that the flash deal data didn’t change frequently, they used Amazon DynamoDB Accelerator (DAX) as a write-through cache for that item. DAX is a fully managed cache that is API-compatible with DynamoDB, requiring minimal code changes (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL) (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL). It acts as a “low-pass filter” for reads, intercepting requests for extremely popular items so they don’t all hit the database (Choosing the Right DynamoDB Partition Key | AWS Database Blog). In this case, DAX (with a 5-minute TTL) served repeat reads of the hot deal, preventing DynamoDB partitions from being swamped by repetitive reads (Choosing the Right DynamoDB Partition Key | AWS Database Blog).

Tuning steps implemented:

Identified the hot partition key via application logs and CloudWatch (saw one DealID causing high latency and throttles) (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL).
Enabled caching for that key using DynamoDB Accelerator (DAX), drastically reducing repeated reads hitting DynamoDB (Choosing the Right DynamoDB Partition Key | AWS Database Blog).
Refactored the partition key schema to add a random suffix for write sharding on new entries (Choosing the Right DynamoDB Partition Key | AWS Database Blog). This required updating the application to write and read across multiple partition key variants.
Validated the fix by load-testing another flash sale scenario in a staging environment, ensuring no single partition key exceeded the throughput limits.

Outcome: The impact was immediate. After enabling DAX, cache hit rates of ~95% on the hot item meant DynamoDB itself handled far fewer requests, and the user-facing latency for that deal dropped from ~500 ms back to ~30 ms (cache responses). The ProvisionedThroughputExceeded errors disappeared. Longer term, the partition key redesign ensured that even if the same deal became popular again, its load would be split across 10 partitions. Subsequent sales events ran without incident – DynamoDB sustained the high traffic with no throttling and consistent single-digit millisecond latency, keeping the flash sale experience smooth. The retailer avoided what could have been a costly outage. This case underscores the importance of choosing a high-cardinality partition key (or manually sharding it) to distribute load evenly (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). It also shows how adaptive capacity and caching can complement good design: DynamoDB adaptive capacity automatically boosted the hot partition’s throughput allocation during the spike (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog), and with DAX the team offloaded enough traffic to ride out the surge. By the end, MegaMart’s DynamoDB usage was battle-tested for extreme peaks, much like Amazon.com’s own DynamoDB-backed systems that handle Black Friday loads with ease (Choosing the Right DynamoDB Partition Key | AWS Database Blog).

Case Study 2: The Throttled Index in the Catalog Service

Problem & Symptoms: An e-commerce fashion site ran into a puzzling issue: their product catalog page was usually fast, but sometimes it became extremely slow to load item listings. This was surprising because the site was still in beta with low traffic. Upon checking AWS console during a slowdown, the engineers saw errors stating “the level of configured provisioned throughput for one or more global secondary indexes was exceeded.” In DynamoDB, Global Secondary Indexes (GSIs) have their own read/write capacity separate from the base table (amazon web services - Why sometimes the DynamoDB is extremely slow? - Stack Overflow). The error indicated a GSI was under-provisioned, causing queries on that index to throttle and crawl. In this case, the catalog table had a GSI for querying products by category, which was misconfigured to very low capacity. When a certain employee ran a category-wide scan for testing, it overwhelmed the GSI’s throughput and led to queries taking 5–10 seconds (or timing out).

Investigation: The team first reproduced the issue outside the app by querying the DynamoDB table directly in the AWS Console’s PartiQL editor. The UI confirmed that queries on the “CategoryIndex” GSI were extremely slow and eventually hit the throughput error. CloudWatch metrics for DynamoDB revealed that the GSI’s consumed capacity spiked above its tiny provisioned limit (just 1 read capacity unit). They realized that no application bug was needed to trigger this – even a manual query in the console would be throttled by such a low-capacity index. The DynamoDB documentation highlights this scenario: for example, if a GSI’s read capacity is set to 1, you can only read ~1 item per second from that index. A query that needs to return 10 items could take ~10 seconds to complete under that limit (amazon web services - Why sometimes the DynamoDB is extremely slow? - Stack Overflow). This exactly matched the symptom. The root cause was simply a mis-provisioned GSI: the team had created the index in provisioned mode with insufficient RCUs (perhaps a copy-paste error or oversight during deployment). Because the base table was on-demand mode, it wasn’t throttling, but the GSI still enforced its provisioned cap, becoming the bottleneck.

Solution (Index Tuning): Fixing this was straightforward: adjust the GSI’s capacity to match the workload. The team updated the “CategoryIndex” to use on-demand capacity (so it would scale automatically with the base table’s traffic) and as a safeguard set an appropriate autoscaling policy for provisioned mode in case they switched back. As AWS notes, for a table in provisioned mode you must explicitly set GSI throughput; but in on-demand mode, GSIs also bill per request which simplifies capacity management (amazon web services - Why sometimes the DynamoDB is extremely slow? - Stack Overflow). After switching to on-demand, the throttling ceased immediately. They also optimized the index’s projections to include all attributes needed by the query (product name, price, etc.) so that the application would not need to do an extra fetch from the base table for each result. This projection tuning improved efficiency because now each query result could be served entirely from the index itself – avoiding the extra read cost and latency of fetching from the main table (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance).

To prevent similar issues, the engineers implemented a couple of best practices: (1) CloudWatch Alarms on GSI throttling metrics, so they would be alerted if any index approaches its capacity limit. (2) Added the GSI’s throughput configuration to their infrastructure-as-code, treating it with the same attention as the base table’s settings. This way, an oversight like a default of 1 RCU would be caught in code review.

Tuning steps implemented:

Increased the GSI throughput: Switched the problematic GSI to on-demand mode, allowing it to scale up to meet spikes (e.g. double the previous peak traffic instantly) (Demystifying Amazon DynamoDB on-demand capacity mode - AWS). In provisioned terms, they gave it enough RCUs to handle the largest category query (~100 items per query, 5 queries/sec => ~500 RCUs).
Enabled autoscaling for the GSI (with target utilization ~70%) in case they return to provisioned mode, to dynamically adjust capacity with usage.
Optimized index projections: The “CategoryIndex” was redefined to project only the necessary attributes (reducing index storage size) but include all data needed by the listing page so no additional GetItem calls were required (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance).
Monitoring: Set up CloudWatch alarms for high ConsumedReadCapacityUnits on the GSI relative to provisioned, to catch any future bottlenecks early.

Outcome: After these changes, the product category pages consistently loaded within ~50–100 ms (down from several seconds). In one test, a scan that previously took 10+ seconds completed in under 1 second once the GSI had proper capacity. The error messages disappeared, and the team gained confidence that the index would scale automatically with traffic. This case highlights the importance of index capacity planning: a GSI can become a hidden bottleneck if forgotten. The separation of throughput for GSIs means you must monitor them just like tables. Best practices from this incident include using on-demand mode for unpredictable workloads and keeping the number of indexes to a minimum. The team realized that each additional GSI not only adds write cost (every table write also writes to the index) but also requires careful provisioning (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). By consolidating some queries, they could avoid creating more GSIs – for example, they considered using a single “overloaded” GSI to serve both category and brand lookups by prefixing the partition key with a type (Category#Shoes vs Brand#Nike). DynamoDB’s flexible schema allows such GSI overloading, where one index can support multiple access patterns (Overloading Global Secondary Indexes in DynamoDB - Amazon DynamoDB). This technique reduces the total indexes needed (well under DynamoDB’s 20 GSIs per table limit) and can cut costs by not duplicating write overhead for many indexes. With the “CategoryIndex” tuned and these practices in place, the catalog service was ready for production traffic.

Case Study 3: Order Pipeline Throughput – Scaling Writes with Shards

Problem & Symptoms: A growing e-commerce platform faced a challenge in their order processing microservice. During peak periods (like holiday sales), the order throughput spiked to tens of thousands of writes per second, as customers checked out rapidly. The Orders table used a partition key OrderDate (daily) and sort key OrderID. This design was chosen to group orders by date. However, it meant all orders for a given day shared the same partition key, which on high-traffic days became a huge write hotspot. DynamoDB automatically partitions data and can scale out when a partition exceeds 10GB or high throughput, but if all writes target one logical partition key, they can still bottleneck on a single partition until a split occurs (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). The symptom was that by mid-day, orders for “today” started getting throttled – the system saw elevated write latency and some orders would briefly fail to persist (triggering retries and slowing the pipeline). The team observed that they were nearing the 1,000 WCU/sec per-partition limit for the hot partition key (the current date) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). Even though DynamoDB’s adaptive capacity tried to boost this partition, and autoscaling had doubled the table’s provisioned WCU, the single-key bottleneck remained.

Investigation: The team used CloudWatch metrics and DynamoDB’s ConsumedWriteCapacity per partition (through DynamoDB Streams metrics) to confirm the issue: almost all writes were going to one partition, consuming ~100% of that partition’s share. They recalled that DynamoDB partitions data by hashing the partition key, so one partition key = one hash bucket. Unless they introduced additional diversity in the key, all writes for that day stayed in one partition until an automatic split might occur. But waiting for DynamoDB to split on its own (which could happen if the partition’s data grew beyond 10GB, or if the table was explicitly re-indexed) wasn’t a timely solution. They needed to proactively spread the load. The lesson was clear: a timestamp as a partition key can concentrate activity, especially for real-time events. This violates the best practice of using a high-cardinality key that naturally spreads traffic (Choosing the Right DynamoDB Partition Key | AWS Database Blog).

During the diagnosis, they also considered DynamoDB’s burst capacity and adaptive behavior. DynamoDB tables can accumulate unused capacity credits (up to 5 minutes’ worth) which can help absorb sudden bursts temporarily. In this case, the traffic wasn’t just a spike; it was sustained, so burst credits were exhausted and couldn’t mask the problem. Adaptive capacity did kick in to redistribute throughput to the busy partition (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog). In fact, the team noticed that after some minutes of sustained imbalance, DynamoDB successfully allowed that hot partition to consume more than its normal share (“boosting” it as described in AWS docs (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog)). However, as traffic kept climbing, they were approaching fundamental limits – one partition can only scale so far without splitting. If they exceeded ~1,000 writes/sec on one key continuously, they’d see throttle errors again. Thus, the investigation concluded they had to change the data model to avoid a single partition key for all hot writes.

Solution (High-Throughput Write Design): The team implemented write sharding on the OrderDate key. Instead of using a plain date (e.g. 2025-03-12) as the partition key for all orders of the day, they introduced a suffix shard based on a hash of the OrderID. For example, an order for Mar 12 might get a partition key of 2025-03-12#A or #B, etc., where the suffix ranged from A–J (10 shards). This immediately multiplies the write throughput capacity for “today” by roughly the number of shards, since writes will be distributed across 10 partition key values instead of one. The DynamoDB Developer Guide recommends this approach for high-volume partitions: “add a random suffix (for example 0–9) to the partition key” to distribute load (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). In their case, they used a deterministic hash to assign orders to shards (so that the same order ID always went to the same suffix, ensuring idempotency on retries). They updated the application logic so that any process writing a new order would compute the shard key on the fly.

Another step was to adjust read patterns accordingly. Downstream systems (like a shipping service that queried orders by date) now needed to read from all shards for a given date. The team provided a utility in their code to abstract this: when reading orders for 2025-03-12, the code would perform 10 parallel Query requests (one for each shard 2025-03-12#<shard>) and merge the results. This added a bit of complexity, but it was acceptable given the throughput gain. They also considered if the increased read fan-out would be an issue. In DynamoDB, doing 10 smaller queries in parallel is usually still very fast, and since those can also be done concurrently, the overall latency remained low (single-digit milliseconds per shard, with the results aggregated in perhaps 20–30 ms total). The trade-off was clearly in favor of sharding: writes are much harder to scale than parallel reads.

To further ensure smooth writes, the team enabled DynamoDB Auto Scaling on the table’s write capacity with aggressive target utilization. They also explored an alternative: using an Amazon Kinesis stream to buffer order writes (the queue buffering pattern). In fact, one suggestion was to have the order service post events to Kinesis or SQS, and have a consumer drain that queue into DynamoDB at a steady rate, smoothing out spikes (Five Ways to Deal With AWS DynamoDB GSI Throttling - Vlad Holubiev). This is a write-offloading strategy that introduces eventual consistency (orders might be delayed by a few seconds in the database) but can absorb extreme burstiness (Five Ways to Deal With AWS DynamoDB GSI Throttling - Vlad Holubiev). They prototyped this, but given that sharding the keys solved the problem within DynamoDB itself, they stayed with the simpler solution of direct writes with sharded keys.

Tuning steps implemented:

Data model change for writes: Implemented a sharded partition key by appending a hash-based suffix to the date key (10 shards). This spreads write load across 10 partitions (Choosing the Right DynamoDB Partition Key | AWS Database Blog).
Application logic update: Modified order insertion code to compute the shard suffix, and adjusted any reads by date to query all suffixes in parallel and combine results (Choosing the Right DynamoDB Partition Key | AWS Database Blog).
Throughput adjustments: Verified and adjusted DynamoDB auto scaling policies to handle the new combined throughput (the table can now utilize 10× throughput for the hot date across shards). Ensured the overall table provisioned capacity (or on-demand limits) were high enough for the sum of all shards.
Optional buffering: (Evaluated but optional) Tested a Kinesis stream as a buffer for peak write bursts, which could be turned on in extreme scenarios to protect DynamoDB by decoupling incoming orders from immediate writes (Five Ways to Deal With AWS DynamoDB GSI Throttling - Vlad Holubiev).

Outcome: The results were dramatic. With 10 shards, the table seamlessly handled the peak of ~12,000 writes per second (which would have been impossible on a single partition key). Throttling dropped to zero, and order writes maintained ~5 ms latency even at peak load. The team measured that before sharding, the “orders per second” graph would plateau around ~1,100 and show throttle events; after sharding, it scaled linearly and the only limit became the overall provisioned throughput of the table, which they could manage via auto scaling. They also observed DynamoDB’s adaptive capacity working even better now – minor imbalances between shards were automatically smoothed out. For instance, if one shard got slightly more traffic than others, DynamoDB adaptive capacity instantly boosted that shard’s allotment so it didn’t throttle (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog) (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog). The system proved robust in production: during a Black Friday event, the order service processed a record volume without any downtime, something that would have been at risk before.

This case reinforces a key DynamoDB design tenet: design your keys for uniform distribution. If a partition key could become a hot spot, introduce a strategy (like sharding or including a user-specific component) to keep traffic even. Also, it showed that adaptive capacity is powerful but not magic – you still must avoid single-item extremes. Finally, by solving the issue in DynamoDB, the team kept their architecture simpler (avoiding extra queue systems) and achieved massive scale within a single table. This validated DynamoDB’s capability of handling high-throughput e-commerce workloads when used with the right patterns.

Case Study 4: Optimizing Read Patterns and Costs in Product Search

Problem & Symptoms: Another scenario arose with an e-commerce startup’s product search feature. They stored all products in a single DynamoDB table and wanted to allow customers to filter products by various attributes (category, price range, brand, etc.). Initially, the team implemented filtering on the application side by fetching broad sets of products and then filtering in memory. For example, to get “all electronics under $100”, the app might Query the “Product” table by category (using a GSI) and then filter the results by price range in code. In worse cases, if no suitable index existed, they did a scan of the whole table and filtered afterwards. This approach worked for a small dataset but as the number of products grew to tens of thousands, read costs and latency skyrocketed. The symptoms were high DynamoDB read capacity consumption (and thus high AWS bills) and slow response times for filtered searches (several seconds). The DynamoDB usage report showed that certain API calls were doing large Scan operations – a red flag, since Scans read the entire table or index, incurring a lot of read units and scaling with table size.

Investigation: The team analyzed the access patterns and identified which queries were most costly. They found that lack of proper indexes for specific queries was the root cause. For instance, filtering by price range was expensive because the application had to retrieve all items of a category and then discard most. They realized DynamoDB can handle these patterns efficiently if the data model is designed to support them. They brainstormed using Global Secondary Indexes or a more denormalized data model. One insight was the concept of a sparse index: a GSI that only includes items that meet a certain criterion, by using a projected attribute that only some items have (Best practices for using secondary indexes in DynamoDB - Amazon DynamoDB) (Best practices for using secondary indexes in DynamoDB - Amazon DynamoDB). For example, they could create a GSI on OnSalePrice that only items on sale would have – then a query for “on-sale items under $100” would hit a much smaller index. Another idea was GSI Overloading: they noticed they already had a GSI for category, and they considered overloading it with additional sort key data to support price filtering. In practice, this meant designing the GSI’s sort key to be something like Price#ProductID. That way, a query for a price range could be done with a sort key condition (e.g. between Price#000 and Price#100). However, since DynamoDB’s query within an index partition is ordered by sort key, they would need the partition key to also group items appropriately. They decided instead to make a dedicated GSI for price-range queries to keep things simple (because not all categories needed price filtering).

They also reviewed cost metrics. By switching from scans to proper queries, they stood to save a lot. For context, scanning 10,000 items might consume 10,000 RCUs (if each item is ~1KB and eventually consistent reads are used), whereas a targeted Query that returns 100 items would consume only 100 RCUs. The difference was two orders of magnitude. In dollar terms, one filtered search had been costing maybe $0.01 of RCUs (if done frequently, this adds up), whereas a proper index could cut that to <$0.0001. On a monthly basis, the engineering manager projected they could save over 80% of DynamoDB read costs by eliminating inefficient access (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance) (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). Moreover, user experience would improve with faster responses.

Solution (Indexing & Query Optimization): The team refactored their data model with query patterns in mind. They created two new GSIs: one for Category+Price and one sparse index for Brand. The CategoryPriceIndex had a partition key of Category and sort key of PriceRange (where PriceRange was a value like 0-100, 100-500, etc., assigned to each product based on its price). This allowed efficient queries like “Electronics in $0-100 range” by querying the index with Category = Electronics AND PriceRange = 0-100. Under the hood, this returned only items that fit that category and price bucket, with O(1) performance relative to result size. They decided on bucketing the price into ranges to avoid overly granular sort keys, and because equality on sort key was sufficient for their filtering needs. In another case, for a range query, they could have used a numeric sort key and the BETWEEN operator to define min and max. DynamoDB’s flexibility here allowed them to choose what made sense. The sparse index they built was on Brand, but only for premium brands that had many products. They achieved this by adding an attribute PremiumBrand to items from top brands, and creating a GSI keyed on that. Items without this attribute don’t appear in the index (Best practices for using secondary indexes in DynamoDB - Amazon DynamoDB), so the index stays small and efficient. Now queries for those brands (which were a common access pattern) hit a much smaller dataset.

Additionally, they revisited Local Secondary Indexes (LSIs). One use case was retrieving products within a category sorted by popularity. Since all products in a category shared the same partition key in the main table (they used a composite primary key: partition = Category, sort = ProductID), they could use an LSI to have an alternate sort key of “PopularityScore”. This would let them query the item collection (all products in a category) ordered by popularity without scanning. They proceeded to implement one LSI on the main table for this purpose. In doing so, they kept in mind LSI limitations: the item collection (all items of a category) must remain under 10 GB (Which flavor of DynamoDB secondary index should you pick? - Momento), which was reasonable for their categories. They also noted that strongly consistent reads on that LSI were possible if needed (a benefit over eventual consistency of GSIs) (Which flavor of DynamoDB secondary index should you pick? - Momento). By carefully choosing projections on the GSIs (only projecting attributes absolutely needed like product title, price, rating), they minimized the storage overhead and kept query performance high (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance).

Finally, they turned on DynamoDB On-Demand mode during development and testing to auto-tune capacity as they tried these new indexes. This allowed them to validate the performance in a POC environment without worrying about provisioning throughput. Once patterns were confirmed, they planned to switch back to provisioned capacity with autoscaling for cost savings (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance) (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). This approach – start on-demand, then optimize and switch to provisioned – is a recommended cost strategy for discovering usage patterns before committing to capacity settings (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance).

Tuning steps implemented:

Data model re-design: Added a CategoryPriceIndex GSI (Partition key: Category, Sort key: PriceRange) to efficiently support price filtering. Added a sparse GSI for PremiumBrand to handle expensive brand-specific queries (Best practices for using secondary indexes in DynamoDB - Amazon DynamoDB). Created one LSI on the main table for PopularityScore to support sorted queries within a category.
Application query changes: Replaced any Scan or client-side filtering logic with DynamoDB Query operations on the appropriate index. For example, a search API call now directly queries CategoryPriceIndex if both category and price filter are present, or queries the PremiumBrandIndex if filtering by a top brand.
Capacity planning and cost management: Used on-demand mode in testing to automatically handle throughput and observed the steady-state RCU/WCU usage for the new indexes. With that data, configured autoscaling on the new GSIs with reasonable minimum capacity. Eliminated unnecessary indexes (they removed one GSI that was rarely used after confirming it through metrics, to save write costs).
Validation and metrics: Measured the average RCUs consumed per search query before vs after. Verified a reduction in RCUs per query by an order of magnitude (e.g. from 500 RCUs for a broad scan down to 5–10 RCUs for a targeted query). Also checked that p95 query latency dropped accordingly.

Outcome: The changes paid off significantly. Search and filter operations that formerly read tens of thousands of items now read only the few hundred relevant items (or fewer), reducing DynamoDB read cost for those operations by ~90%. In one example, a query for “Books between $10-$20” went from consuming ~1200 RCUs and taking ~3 seconds, to consuming just 12 RCUs and completing in 50 ms after the CategoryPriceIndex was introduced (figures hypothetical but in line with expectations). The user experience improved as pages of filtered results loaded almost instantaneously. From a cost perspective, the team observed their DynamoDB bill for reads drop by around 70% the next month, as the expensive scans were eliminated. They also noticed secondary benefits: by minimizing scans, they reduced the impact on DynamoDB’s adaptive capacity and caching. (Large scans can blow out DAX caches or interfere with other traffic; those were no longer needed.) The new indexes did incur some additional write cost (each product write now also writes to two GSIs, doubling the WCU for that operation), but this was a known trade-off. Thanks to GSI overloading techniques, they managed to avoid creating a separate index for every possible filter combination. For instance, they piggybacked the price filter onto the Category index rather than making a standalone Price index, thereby keeping the total GSIs manageable. Each write of a product item now triggers at most 2 index writes instead of, say, 5 or 6 for multiple disparate indexes.

Through this case, the startup learned the importance of modeling your DynamoDB schema for your query patterns up front. DynamoDB is schema-flexible, but not schema-less when it comes to access patterns – you need to plan your secondary indexes to match the queries your application will make. Using sparse indexes and composite keys can elegantly handle common filters without resorting to full table scans. Moreover, the exercise of measuring RCU/WCU usage per operation gave them a deeper understanding of throughput cost management. They now routinely use on-demand capacity for initial POCs to see how the database behaves, then switch to provisioned with autoscaling for production to get the best of cost and performance (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). By applying these optimizations, the product search feature became both fast and cost-efficient, exemplifying DynamoDB’s ability to deliver at scale when tuned correctly.

Best Practices and Lessons Learned

The above case studies highlight several actionable best practices for DynamoDB performance tuning in e-commerce applications:

Design for Even Partition Key Distribution: Pick a partition key that naturally distributes traffic (e.g. user IDs, order IDs) or incorporate a strategy to avoid hot keys. High-cardinality keys (many distinct values) prevent any single partition from overloading (Choosing the Right DynamoDB Partition Key | AWS Database Blog). If you anticipate a hot key (like a popular item or a high-volume time bucket), consider sharding it with a random or computed suffix (Choosing the Right DynamoDB Partition Key | AWS Database Blog). This ensures no single partition key exceeds DynamoDB’s per-partition limits (≈3000 RCUs or 1000 WCUs/sec) (Choosing the Right DynamoDB Partition Key | AWS Database Blog). Adaptive capacity will help handle minor skews by boosting hot partitions (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog), but it’s not a substitute for good key design.
Leverage Adaptive Capacity but Know Its Limits: Since 2019, DynamoDB adaptive capacity is instant and on by default (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog). It will automatically allocate more throughput to hot partitions to avoid throttling as long as your overall table capacity allows (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog) (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog). This means DynamoDB can handle uneven workloads more gracefully than in the past, often eliminating the need to manually repartition data. However, adaptive capacity cannot defy hard limits: if one partition key is an extreme outlier (e.g. millions of requests per second on one key), you still need to redesign or shard. Use CloudWatch’s ConsumedCapacity metrics to detect when one key is consuming a large share and use techniques like those in Case 1 and 3 to mitigate.
Use GSIs and LSIs Strategically: Secondary indexes are essential for rich query capabilities, but use them wisely. Global Secondary Indexes (GSIs) allow different partition/sort keys and have their own throughput. Keep the number of GSIs to a minimum – each additional GSI means additional writes (and cost) on every table update (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). Instead of one index per query pattern, consider GSI overloading, where a single GSI’s keys encode multiple access patterns (Overloading Global Secondary Indexes in DynamoDB - Amazon DynamoDB). For example, use a sort key with a prefix like TYPE#value so that you can query TYPE=A vs TYPE=B in the same index to serve two different queries. This reduces index count and write overhead, as shown when consolidating category and price queries. Local Secondary Indexes (LSIs) share the table’s partition key and are ideal for providing alternate sort orders or filtering within the same item collection. They have no extra cost for writes (since they use the base table’s WCU) and support strong consistency (Which flavor of DynamoDB secondary index should you pick? - Momento). But remember the constraints: all items for a given partition (the item collection) across the table + LSIs must fit in 10 GB, and the collection’s throughput is limited to 1000 WCU / 3000 RCU per second (Which flavor of DynamoDB secondary index should you pick? - Momento) (Which flavor of DynamoDB secondary index should you pick? - Momento). LSIs are great for scenarios like “different views of a customer’s orders” (where partition is customer and that won’t exceed 10GB). For broad queries across many partitions, GSIs are the way to go.
Optimize Index Projections and Avoid Fetches: When creating an index, project only the attributes needed by your query results (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). This keeps the index small and speeds up queries. Ensure that a query on the index can get all the data it needs without having to do a follow-up GetItem on the main table (which doubles the read cost and adds latency) (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). In practice, this means if your GSI is used to display product name and price, project those attributes into the GSI so the query returns them directly. This was implemented in Case 2 to avoid fetching from the base table for each catalog item. Smaller index items = less RCU per read and lower storage cost.
Use Sparse Indexes for Filtered Data: A sparse index is an index that only includes items that have a certain attribute (the index partition or sort key). This is a powerful way to reduce the volume of data an index stores and reads (Best practices for using secondary indexes in DynamoDB - Amazon DynamoDB). Case 4’s premium brand index is an example: only items with PremiumBrand flag appear in that GSI, making queries on that index naturally filter out all other brands with zero cost. If you have a frequent query that only applies to a subset of items (e.g. “items on sale”, “users with status=active”), you can make an index on that attribute. Items without it won’t bloat your index. The result is faster, cheaper queries.
Throughput Capacity Management: For unpredictable workloads, On-Demand capacity mode is a lifesaver – it requires no capacity planning and will scale your table up automatically (billing per request). Several case studies (ZOZOTOWN’s cart migration, etc.) have used on-demand to survive flash sales without manual intervention (How Amazon DynamoDB supported ZOZOTOWN’s shopping cart migration project | AWS Database Blog) (How Amazon DynamoDB supported ZOZOTOWN’s shopping cart migration project | AWS Database Blog). For example, ZOZOTOWN could handle sudden traffic spikes during sales with zero ops effort by using on-demand, at the cost of a higher per-request fee (How Amazon DynamoDB supported ZOZOTOWN’s shopping cart migration project | AWS Database Blog). A good practice is to start with on-demand in development and early production to observe traffic patterns (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). Once you understand the typical throughput, you can switch to Provisioned mode with autoscaling to save money (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). Provisioned capacity lets you pay a steady rate and scale gradually; with auto scaling in place, the table will increase/decrease capacity within set bounds as traffic grows or ebbs. In Case 3, autoscaling was used to adjust capacity for the orders table through the day. Just ensure your auto scaling settings allow scaling up fast enough for your peaks, and set a reasonable maximum to avoid surprise bills.
Monitor and Test with Realistic Loads: It’s critical to set up monitoring on key DynamoDB metrics: look at ThrottleEvents, ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits (especially per GSI), and the SuccessfulRequestLatency for any spikes. In e-commerce, you might also use CloudWatch Contributor Insights on DynamoDB to find your most accessed keys (this can identify a hot partition key in a live system). All the teams in our case studies instrumented their systems – from logging slow DynamoDB calls (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL) to creating CloudWatch alarms – to catch issues early. When designing a new feature, run load tests that simulate peak traffic or worst-case access patterns. For instance, test what happens if everyone hits the same item or same category page. This will reveal if you need additional indexes or a better key strategy before real customers are affected.
Caching and DAX: For read-heavy workloads, caching can dramatically improve performance and reduce cost. Amazon DynamoDB Accelerator (DAX) is an easy drop-in cache that can cut read latency by up to 10x and handle millions of reads per second (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL). In Case 1, DAX helped absorb a flood of repeated reads on a popular item. AWS reports that caching frequently accessed items can reduce direct DynamoDB reads (RCUs) by 80% or more (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance) (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance). Use caching for hot keys or expensive queries that don’t need absolutely up-to-the-second freshness. DAX is write-through, so it won’t serve stale data after updates, which is great for carts, product info, etc. In addition, consider edge caching strategies (like CloudFront) for public data – as seen in the hot partition case where an API was cached at the CDN instead of hitting DynamoDB each time (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL). The fewer calls to DynamoDB, the less chance of overload and the lower your cost.

In conclusion, AWS DynamoDB can power planet-scale e-commerce systems (Amazon.com itself relies on it (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL)), but getting optimal performance requires aligning your data model with your access patterns. Partition your data to avoid hotspots, index wisely to support queries without full scans, and take advantage of features like adaptive capacity, autoscaling, and DAX. The real-world cases above demonstrate that with careful tuning, DynamoDB can effortlessly handle high throughput: from flash sales handling hundreds of thousands of checkouts to search services retrieving products with millisecond latency. By following these best practices and learning from these scenarios, your e-commerce application’s backend will be prepared to deliver a smooth, fast customer experience – even under the most demanding workloads.

Sources:

Hanzawa, S. & Narita, T. (2022). How Amazon DynamoDB supported ZOZOTOWN’s shopping cart migration project. AWS Database Blog – DynamoDB case study (How Amazon DynamoDB supported ZOZOTOWN’s shopping cart migration project | AWS Database Blog) (How Amazon DynamoDB supported ZOZOTOWN’s shopping cart migration project | AWS Database Blog).
Balasubramanian, G. & Shriver, S. (2017, updated 2022). Choosing the Right DynamoDB Partition Key. AWS Database Blog (Choosing the Right DynamoDB Partition Key | AWS Database Blog) (Choosing the Right DynamoDB Partition Key | AWS Database Blog).
Blazeclan (2022). The Amazonian’s NoSQL – A DynamoDB Hot Partition Use Case. Blazeclan Tech Blog (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL) (DynamoDB Hot Partition Use Case - The Amazonian's NoSQL).
Stack Overflow (2022). Why sometimes the DynamoDB is extremely slow? (Discussion of GSI throttling) (amazon web services - Why sometimes the DynamoDB is extremely slow? - Stack Overflow) (amazon web services - Why sometimes the DynamoDB is extremely slow? - Stack Overflow).
Holubiev, V. (2023). Five Ways to Deal With AWS DynamoDB GSI Throttling. [Online article] – GSI design tips (Five Ways to Deal With AWS DynamoDB GSI Throttling - Vlad Holubiev) (Five Ways to Deal With AWS DynamoDB GSI Throttling - Vlad Holubiev).
Simform Engineering (2021). Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance) (Amazon DynamoDB Best Practices: 10 Tips to Maximize Performance).
Momento (2023). Which flavor of DynamoDB secondary index should you pick? – Discussion on GSI vs LSI trade-offs (Which flavor of DynamoDB secondary index should you pick? - Momento) (Which flavor of DynamoDB secondary index should you pick? - Momento).
AWS DynamoDB Developer Guide. Best practices for secondary indexes – documentation on sparse indexes and index overloading (Best practices for using secondary indexes in DynamoDB - Amazon DynamoDB) (Overloading Global Secondary Indexes in DynamoDB - Amazon DynamoDB).
AWS Database Blog (2018, 2019). Adaptive capacity and DynamoDB scaling (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog) (How Amazon DynamoDB adaptive capacity accommodates uneven data access patterns (or, why what you know about DynamoDB might be outdated) | AWS Database Blog).
Commerce Architects (2023). DynamoDB Case Study – ClickBank. [Online case study] – Benefits of DynamoDB in microservices (50% processing time reduction) (DynamoDB Case Study Clickbank) (DynamoDB Case Study Clickbank).

databases system-design aws

SerialReads

DynamoDB Performance Tuning in High-Traffic E-commerce Applications

DynamoDB Performance Tuning in High-Traffic E-commerce Applications

Case Study 1: The Flash Sale Hot Partition Mystery

Case Study 2: The Throttled Index in the Catalog Service

Case Study 3: Order Pipeline Throughput – Scaling Writes with Shards

Case Study 4: Optimizing Read Patterns and Costs in Product Search

Best Practices and Lessons Learned