Strategies for syncing denormalized data in DynamoDB

Blog article here: https://serverlessfirst.com/dynamodb-denormalization-strategies/ ## Why denormalize at all? When working with [[DynamoDB]] (and indeed with most [[NoSQL]] databases), you may find that you need to copy certain data fields into different locations within the same table, or even into a different table. This duplication is often referred to as denormalization. The reason this is needed is because, unlike [[RDBMS]] databases, DynamoDB doesn't support performing joins in a single query. Therefore, in order to get all the data you require in an efficient read operation, you need to store it in the same place as the other data being fetched with it. ## Example use case Our application is using a [[DynamoDB single-table design|Single-table design]] data model and has User and Organization entities, each stored in separate partitions. The "master copy" of the User data is stored under a `USER-{userId}` partition. But we also need to store the User details of the Organization owner under a `ORGANIZATION-{orgId}` partition. So whenever the user updates their `displayName` (say through an [[AWS AppSync|AppSync]] API mutation), there are 2 separate DynamoDB itemswe need to update. ## Implementation strategies There are a few different ways we can perform this denormalization, each with different pros and cons. Three of these strategies are listed below. *The trigger for all of these strategies would be some form of mutative event (usually from an API request) that will cause a change (create, update, delete) to a data field which is being stored in more than one DynamoDB item.* ### Strategy 1: In a single atomic transaction This involves using the [`TransactWriteItems` operation](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_TransactWriteItems.html) to perform all the necessary changes (Put, Update, Delete) within a single atomic action. Pros: - Copies of data are always consistent - No need for separate Lambda function - Easy to reason about in codebase as all updates are kept together in single function Cons: - Increased user latency. Writing transactions will be marginally slower than a standard put/update but an added latency will occur if you need to perform a Get/Query before doing the transactWrite in order to fetch all the item primary keys to be updated - Can't be (easily) performed in "constrained code" environments such as [[AppSync VTL resolver]]s and [[AWS Step Functions]], so you'll probably need to do it in a Lambda function - If a particular data field can be updated from multiple sources (say different API endpoints), then this transactional logic will need to be carried out in each handler. This can be mitigated by keeping all the transactional updates in a shared module, but developers still need to know to use this. - The `TransactWriteItems` API operation has a limit of 25 items that can be written in single transaction. So if you have more copies than this (e.g. when copying parent root data into several child entities), you'll lose the consistency benefit and have to code the API requests into batches. ### Strategy 2: Asynchronously in a [[DynamoDB streams]] handler This involves the API handler code simply updating the master copy of the User item in the DynamoDB table and a separate Lambda function being used to trigger off a DynamoDB stream and perform the required "copies". Pros: - Low-latency for user - Guaranteed to run irrespective of what source triggers the update of the master copy item Cons: - Slight delay in updates to master and duplicate copies - DynamoDB Streams are noisy and don't allow filtering (see [[Pros and cons of DynamoDB streams#Limitations and risks]]). This can result in complex handler logic in the same function if you have several different denormalised data items. This is particularly an issue for [[DynamoDB single-table design|Single-table design]] data models. - Risk of infinite recursion bug with strea handlers if you accidentally update the master copy again - Harder to reason about in codebase as the master copy changes are separate from the duplicates ### Strategy 3: Asynchronously via an [[AWS EventBridge|EventBridge]] handler This involves the API handler code updating the master copy of the item in DynamoDB and then publishing a `USER_UPDATED` event to EventBridge which a separate Lambda handler would hook into and perform the required "copies". Pros: - Doesn't require DynamoDB reads before performing the write - Can maintain single-purpose Lambda functions Cons: - Slightly slower user-facing latency given extra network call to EventBridge - Slight delay in updates to master and duplicate copies - Can't be (easily) performed in "constrained code" environments such as [[AppSync VTL resolver]]s and [[AWS Step Functions]], so you'll probably need to do it in a Lambda function - If a particular data field can be updated from multiple sources (say different API endpoints), then this EventBridge publishing logic will need to be carried out in each handler. This can be mitigated by keeping all the denormalized updates in a shared module, but developers still need to know to use this. - Rollback code may be required — Since this is effectively performing a [[Distributed transaction]] (a write to DynamoDB and EventBridge) within a single Lambda function. We would need to implement a try-catch block when writing to EventBridge and (in the situation that a transient error occurs in EV), add code to rollback the DynamoDB update and then return error to user. It's highly unlikely for it to fail but if it does and there is no rollback code, then the data in DynamoDB will be inconsistent - Harder to reason about in codebase as the master copy changes are separate from the duplicates ## Deciding between these strategies The pros and cons of each strategy will have different weights depending on your use case. My default approach would be strategy 1 as it has the least moving parts, and when all other factors are (almost) equal, I like to optimise for greater code maintainability. But if your context requires a very fast user response and you need to perform several reads to gather the data items to be updated, you may opt for 2 or 3. --- ## References - [Should Your DynamoDB Table Be Normalized or Denormalized?](https://aws.amazon.com/blogs/database/should-your-dynamodb-table-be-normalized-or-denormalized/) (AWS Blogs) --- tags: #DynamoDB, #DataModelling, #Articles