I’ve been doing a fair amount of self-education into how to design and build data-intensive systems, and one of the key concepts that keeps cropping up is that of concurrency control. The topic of concurrency is super-important in the context of data management and processing, so here’s a little deep dive.
In its simplest form, concurrent computing (as opposed to sequential computing or parallel computing) is a scheduling technique in which multiple processes are simultaneously executing during overlapping time periods. If these processes are accessing the same data, there is a possibility that one process may alter something and interfere with or disrupt the other process. In the world of transactional databases, this could mean that two transactions may access the same data at the same time and affect the consistency and correctness of the overall system.
How, you may ask? There are a few common scenarios:
- The Lost Update: A transaction writes an initial value that other transactions need to in order operate correctly. However, a second transaction goes in and writes a second value on top of the first value. This means that when other concurrent transactions go in to read the first value, they read the wrong value and end up with incorrect results.
- The Dirty Read: A transaction writes a value but then is later aborted. In this case, the value should disappear upon abort, but without concurrency control, other transactions may come in and read the value that should have been deleted (e.g. “dirty read).
- The Incorrect Summary: Suppose one transaction is calculating some value or summary data over all of the values in some data set. If a second transaction updates some of that data, this can lead to incorrect results depending on the timing of the update and whether or not the update result has already been included in the summary (or not).
As you can imagine, this is Really Bad in the context of financial systems, healthcare, and other critical environments.
Enter concurrency control. The idea is that systems must be designed with rules and methods in place to maintain the consistency of components operating concurrently and thus the consistency and correctness of the whole system. I’ll save the technical specifics of race conditions, deadlines, and resource starvation for another post. For now, just know that there are a few approaches.
Optimistic Concurrency
Optimistic concurrency is a strategy where we save the checking of whether a transaction meets the isolation and integrity rules until the end, without blocking on any of its read/write operations. It is optimistic, after all! It assumes that the rules are being met!
However, if there is a violation (i.e. the record is dirty), then the transaction is aborted and restarted. This obviously incurs additional overhead, but if there aren’t a ton of transactions being aborted, then being optimistic is usually a good strategy.
(Being optimistic in general is a good strategy but I digress…)
Pessimistic Concurrency
You can probably guess how this one works. Like the overly dramatic, angsty Daria Morgendorffer from the late-90s MTV animated sitcom Daria, pessimistic concurrency assumes (just assumes!) there will be an integrity violation. In this case, the entire operation will be blocked until the possibility of the violation disappears. This approach has much better integrity than optimistic locking but requires you to be careful with your application design to avoid deadlocks.
Semi-Optimistic Concurrency
Can’t decide? You’re in luck! With Semi-optimistic concurrency, operations are blocked only in some situations if it is deemed they might violate some of the rules. Operations are not blocked in other situations, but rules checking is still done at the end as in the optimistic scenario.
So which is best? It depends. Different concurrency models provide different performance depending on the transaction type, computing parallelism, and a host of factors. A general rule of thumb is to use the optimistic model in an environment where there is expected to be low contention for data (e.g. ingest pipelines, high-volume systems, multi-tier distributed architectures, etc.) and stick to pessimistic concurrency in situations where conflicts happen frequently.