Batch Process in AWS - Part 2 - 500's Galore!
If you haven't already read the previous blog, start here.
AH! I totally forgot that you golden rule of AWS architecture:
Everything fails, all the time
- Werner Vogels
If you'd rather watch a video, you can check it out on YouTube
Where I Went Wrong
When I was testing my atomic counter solution from the previous article I ran some touch testing using batches in the order of 1000 records. However, that wasn't a large enough scale to expose a problem that DynamoDB (and any service) can run into: service failures.
Let's see what happens when I take the previous solution and bump it up to 100,000 records. Once complete, I see my atomic counter is a little off:
Ooops, that's not good!
So what's happened here? Well, behind the scenes DynamoDB threw some 500 errors. In a highly distributed system, this happens. Perhaps a node was replaced, perhaps hardware failed. I didn't see any of these errors because the SDK automatically retries the call. And the SDK has no clue if the previous Put command made it to the database or not. At a high enough scale, this problem is exacerbated and we see this 'over-counting' problem.
So, what now? Well, as it turns out, AWS has already published a blog on what to do with atomic counters.
Let's walk through each of the proposed solutions and see if any will work.
Spec Change
The solution I had seemed to work. But that's only because the rate of failure was low compared to the number of attempts I was making. However, it would have still failed eventually, regardless of the scale.
A solution that doesn't work at scale is not a solution that works.
The spec needs some updating, specifically that the number of items in the batch needs to be increased to 100,000. This should sufficiently expose any problems like the 500's seen previously. Additionally, this aligns more with the problem my client was facing, as they needed to process all the accounts in their system, and they have well over 100k currently.
From now on, all tests with be with 100k items in the batch. Additionally, the items will try to be processed with as much parallelism as possible so the overall runtime is minimized.
Option 1: Atomic Counter
Well, this is what I already tried and it doesn't work. It's only going to work in low scales (low counters) and when over-counting is acceptable.
Option 2: Optimistic concurrency control
Optimistic concurrency means that we read the batchState
record before we try to write it.
The write is conditional on an eTag attribute that will change with each write.
We could use the itemId
as the eTag.
However, with a highly parallelized batch like this process is trying to achieve, this would cause a LOT of reprocessing and churn as it's extremely likely that the conditional write would fail at a very high rate, since so many items would be processing at the same time.
I think this solution would work well when you're talking about an inventory system where an item being ordered (and it's inventory count reduced by X) would rarely happen so close together that the conditional write would fail.
Option 3: Optimistic concurrency control with history
Same issue here as with the previous option.
Option 4: Transaction with a client request token
Transactions and a client request token seemed like a possible solution. However, the high concurrency of the batch processing results in most transactions failing and churn.
I tried this solution out and nearly every update of the batchState
item resulted in a failure and caused
reprocessing.
Option 5: Transaction with a marker item
Same problem as the previous solution
Option 6: Counting with an item collection
This solution meets the needs of the high concurrency and high volume, since the batchState
item is not needed.
In fact, the itemState
record is already representing this record.
However, it'd be nearly impossible to write an efficient method to check to see if all the items had been processed. If the batch is 100k items then we'd have to start polling to see if all 100k had been written and that is going to be incredibly expensive.
This would be less expensive if the batch size was smaller.
I think one good solution would be to use the atomic counter from option 1 combined with this method along with a nightly reconciliation process.
But in the case with a large scale batch, this is also a no-go. However, I think it's close and gave me an idea. More of that in part 3.
Option 7: Counting with a set
This method involves using an attribute on the batchState
record to know if the particular item has already been
accounted for in the remaining
attribute. However, with a batchSize
of 100k, we'd hit the document size limitations
of DynamoDB.
Step Functions
As Yan Cui mentioned in his article, I could use Step Functions.
I blew this solution off prematurely as I wanted to stick with an SQS-based solution so to not cause any rewrites by my client. I was also scared off prematurely on the execution costs.
However, based on what I'm seeing now, this might be the only way to get it done. Stay tuned for part 3 where I try this out and see how it works.
Conclusion
Well, this is embarrassing. Here I thought I found a really cheap (both in implementation time and runtime costs) solution only to have it blow up pretty magnificently (and publicly) in my face.
Rather than get discouraged, I'm going to take this as a learning opportunity and I'm going back to the drawing board.
Please see the (maybe) conclusion in part 3.