Batch Process in AWS - Part 2 - 500's Galore!

September 21, 2024

If you haven't already read the previous blog, start here.

AH! I totally forgot that you golden rule of AWS architecture:

Everything fails, all the time

Werner Vogels

If you'd rather watch a video, you can check it out on YouTube

Where I Went Wrong

When I was testing my atomic counter solution from the previous article I ran some touch testing using batches in the order of 1000 records. However, that wasn't a large enough scale to expose a problem that DynamoDB (and any service) can run into: service failures.

Let's see what happens when I take the previous solution and bump it up to 100,000 records. Once complete, I see my atomic counter is a little off:

Ooops, that's not good!

So what's happened here? Well, behind the scenes DynamoDB threw some 500 errors. In a highly distributed system, this happens. Perhaps a node was replaced, perhaps hardware failed. I didn't see any of these errors because the SDK automatically retries the call. And the SDK has no clue if the previous Put command made it to the database or not. At a high enough scale, this problem is exacerbated and we see this 'over-counting' problem.

So, what now? Well, as it turns out, AWS has already published a blog on what to do with atomic counters.

Let's walk through each of the proposed solutions and see if any will work.

Spec Change

The solution I had seemed to work. But that's only because the rate of failure was low compared to the number of attempts I was making. However, it would have still failed eventually, regardless of the scale.

A solution that doesn't work at scale is not a solution that works.

The spec needs some updating, specifically that the number of items in the batch needs to be increased to 100,000. This should sufficiently expose any problems like the 500's seen previously. Additionally, this aligns more with the problem my client was facing, as they needed to process all the accounts in their system, and they have well over 100k currently.

From now on, all tests with be with 100k items in the batch. Additionally, the items will try to be processed with as much parallelism as possible so the overall runtime is minimized.

Option 1: Atomic Counter

Well, this is what I already tried and it doesn't work. It's only going to work in low scales (low counters) and when over-counting is acceptable.

Option 2: Optimistic concurrency control

Optimistic concurrency means that we read the batchState record before we try to write it. The write is conditional on an eTag attribute that will change with each write. We could use the itemId as the eTag.

However, with a highly parallelized batch like this process is trying to achieve, this would cause a LOT of reprocessing and churn as it's extremely likely that the conditional write would fail at a very high rate, since so many items would be processing at the same time.

I think this solution would work well when you're talking about an inventory system where an item being ordered (and it's inventory count reduced by X) would rarely happen so close together that the conditional write would fail.

Option 3: Optimistic concurrency control with history

Same issue here as with the previous option.

Option 4: Transaction with a client request token

Transactions and a client request token seemed like a possible solution. However, the high concurrency of the batch processing results in most transactions failing and churn.

I tried this solution out and nearly every update of the batchState item resulted in a failure and caused reprocessing.

Option 5: Transaction with a marker item

Same problem as the previous solution

Option 6: Counting with an item collection

This solution meets the needs of the high concurrency and high volume, since the batchState item is not needed. In fact, the itemState record is already representing this record.

However, it'd be nearly impossible to write an efficient method to check to see if all the items had been processed. If the batch is 100k items then we'd have to start polling to see if all 100k had been written and that is going to be incredibly expensive.

This would be less expensive if the batch size was smaller.

I think one good solution would be to use the atomic counter from option 1 combined with this method along with a nightly reconciliation process.

But in the case with a large scale batch, this is also a no-go. However, I think it's close and gave me an idea. More of that in part 3.

Option 7: Counting with a set

This method involves using an attribute on the batchState record to know if the particular item has already been accounted for in the remaining attribute. However, with a batchSize of 100k, we'd hit the document size limitations of DynamoDB.

Step Functions

As Yan Cui mentioned in his article, I could use Step Functions.

I blew this solution off prematurely as I wanted to stick with an SQS-based solution so to not cause any rewrites by my client. I was also scared off prematurely on the execution costs.

However, based on what I'm seeing now, this might be the only way to get it done. Stay tuned for part 3 where I try this out and see how it works.

Conclusion

Well, this is embarrassing. Here I thought I found a really cheap (both in implementation time and runtime costs) solution only to have it blow up pretty magnificently (and publicly) in my face.

Rather than get discouraged, I'm going to take this as a learning opportunity and I'm going back to the drawing board.

Please see the (maybe) conclusion in part 3.