9 Nisan 2026 Perşembe

Distributed Lock Source of Truth Olabilir mi?

Giriş
Soru şöyle
You have a distributed lock to prevent two users from booking the same hotel room.

Lock expires in 5 seconds. Your DB write takes 6 seconds under load.

Two users got confirmed bookins for the same room. How? What is the process to fix this issue.
Aslında şuna dikkat etmek lazım.
Lock ≠ correctness.
If your DB allows duplicates, your system will eventually produce them.
The real fix lives in atomic writes + constraints, not just distributed locks.
Yani lock aslında işlemi en baştan yapmamak için. Eğer iki işlem başlarsa bir tanesi başarısız olmalı.

Açıklaması şöyle
This is a correctness question. And at the Senior to Principal level, this is exactly what interviewers are testing for: do you understand the difference between coordination and actual data integrity?

If you are preparing for system design interviews right now, this is the kind of failure-mode thinking that matters a lot in strong loops.

Now, let us break this one down properly.

[1] How did both users get confirmed bookings?

The timeline usually looks like this:

- User A acquires the distributed lock for Room 101
- Lock lease is valid for 5 seconds
- User A starts the DB write to mark the room as booked
- Under load, that DB write takes 6 seconds
- At second 5, the lock expires before User A finishes
- User B now acquires the same lock because the lock service thinks it is free
- User B also starts a booking write
- Both flows eventually return success, and both users get confirmations

So what actually failed here? The system assumed the distributed lock was the source of truth. A lease-based lock only gives you temporary coordination.

If the critical section takes longer than the lease, another actor can enter while the first one is still working.

I cover fundamentals like locking, transactions, consistency, retries, idempotency, and failure handling in much more depth inside my System Design Fundamentals Guide for Senior to Principal engineers.

You can check it out here: puneetpatwari.in

[2] The deeper bug is usually not the lock itself

A lot of candidates stop at “increase the lock timeout.” That is not the real fix. The deeper issue is that your final correctness guarantee is missing at the database layer.

Because even if the lock expires, the database should still protect the invariant: “Only one valid booking can exist for this room for this date range.”

If both writes succeeded, it usually means one of these is true:
- no proper uniqueness or exclusion constraint existed
- booking availability was checked outside the final transaction
- writes were not serialized with row-level locking
- confirmation was sent before durable conflict detection finished

The lock helped reduce contention.
But the DB failed to enforce correctness.

[3] What is the right process to fix it

I would fix this in 4 steps.

1. Reconstruct the exact race
Check lock acquire time, lock expiry time, DB commit time, and confirmation event time for both users.

2. Move the invariant to the database

For hotel booking, correctness should be enforced with transactional logic such as:
- row-level locking on the inventory row
- atomic reserve-if-available update
- or exclusion/uniqueness constraints depending on data model

3. Treat the distributed lock as an optimization.
It can reduce hot contention, but it should never be the only thing preventing double booking.

4. Fix the confirmation path
Only send “booking confirmed” after the transaction commits successfully and conflict checks have passed.

5] If you still want to use distributed locks, do it safely

If a distributed lock stays in the design, I would add:
- lease renewal or heartbeats for long critical sections
- fencing tokens so stale lock holders cannot keep writing
- alerts when p99 DB latency gets too close to lock TTL
- idempotency keys so retries do not create duplicate booking flows

A good rule of thumb is simple: If your lock TTL is 5 seconds and your write path can take 6 seconds under load, your design is already telling you it is unsafe.

Hiç yorum yok:

Yorum Gönder