This is a correctness question. And at the Senior to Principal level, this is exactly what interviewers are testing for: do you understand the difference between coordination and actual data integrity?
If you are preparing for system design interviews right now, this is the kind of failure-mode thinking that matters a lot in strong loops.
Now, let us break this one down properly.
[1] How did both users get confirmed bookings?
The timeline usually looks like this:
- User A acquires the distributed lock for Room 101
- Lock lease is valid for 5 seconds
- User A starts the DB write to mark the room as booked
- Under load, that DB write takes 6 seconds
- At second 5, the lock expires before User A finishes
- User B now acquires the same lock because the lock service thinks it is free
- User B also starts a booking write
- Both flows eventually return success, and both users get confirmations
So what actually failed here? The system assumed the distributed lock was the source of truth. A lease-based lock only gives you temporary coordination.
If the critical section takes longer than the lease, another actor can enter while the first one is still working.
I cover fundamentals like locking, transactions, consistency, retries, idempotency, and failure handling in much more depth inside my System Design Fundamentals Guide for Senior to Principal engineers.
You can check it out here: puneetpatwari.in
[2] The deeper bug is usually not the lock itself
A lot of candidates stop at “increase the lock timeout.” That is not the real fix. The deeper issue is that your final correctness guarantee is missing at the database layer.
Because even if the lock expires, the database should still protect the invariant: “Only one valid booking can exist for this room for this date range.”
If both writes succeeded, it usually means one of these is true:
- no proper uniqueness or exclusion constraint existed
- booking availability was checked outside the final transaction
- writes were not serialized with row-level locking
- confirmation was sent before durable conflict detection finished
The lock helped reduce contention.
But the DB failed to enforce correctness.
[3] What is the right process to fix it
I would fix this in 4 steps.
1. Reconstruct the exact race
Check lock acquire time, lock expiry time, DB commit time, and confirmation event time for both users.
2. Move the invariant to the database
For hotel booking, correctness should be enforced with transactional logic such as:
- row-level locking on the inventory row
- atomic reserve-if-available update
- or exclusion/uniqueness constraints depending on data model
3. Treat the distributed lock as an optimization.
It can reduce hot contention, but it should never be the only thing preventing double booking.
4. Fix the confirmation path
Only send “booking confirmed” after the transaction commits successfully and conflict checks have passed.
5] If you still want to use distributed locks, do it safely
If a distributed lock stays in the design, I would add:
- lease renewal or heartbeats for long critical sections
- fencing tokens so stale lock holders cannot keep writing
- alerts when p99 DB latency gets too close to lock TTL
- idempotency keys so retries do not create duplicate booking flows
A good rule of thumb is simple: If your lock TTL is 5 seconds and your write path can take 6 seconds under load, your design is already telling you it is unsafe.