13 Nisan 2026 Pazartesi

Production Issues Troubleshoot

Bazı problemler şöyle
Here are 15 real production scenario-based questions:

1. Your Spring Boot service CPU suddenly spikes to 90% in production. How will you investigate and fix it?

2. After deployment, your service starts throwing intermittent 500 errors. How will you debug this issue?

3. One microservice goes down and causes a chain failure in other services. How will you prevent this in future?

4. Your API response time increased from 200ms to 3 seconds after a new release. How will you identify the root cause?

5. Database connections are getting exhausted under load. What steps will you take to fix this?

6. A third-party service you depend on is timing out frequently. How will you handle this in your system?

7. You observe duplicate transactions happening in your system. How will you prevent this?

8. Logs are too large and distributed, making debugging difficult. How will you improve observability?

9. Memory usage keeps increasing and your service crashes after some time. How will you detect and fix memory leaks?

10. Your microservice works fine locally but fails in production. How will you approach debugging?

11. A new deployment breaks one feature but works for others. How will you safely roll back?

12. Traffic suddenly spikes 5x during peak hours and your service becomes slow. How will you scale?

13. Inter-service communication is failing due to network latency. How will you optimize it?

14. You need to trace a single request across multiple services during a failure. How will you implement tracing?

15. A bug in one service causes inconsistent data across multiple services. How will you handle data consistency?
Cpu Spike
Bir başka örnek burada

Database connections are getting exhausted under load
Örnek şöyle
@Service
public class UserService {
  @Autowired
  private JdbcTemplate jdbcTemplate; // OK

  @Transactional
  public void updateUsers(List<User> users) {
    users.forEach(user -> 
      jdbcTemplate.update(
        "UPDATE users SET last_login = ? WHERE id = ?",
        LocalDateTime.now(), user.getId()
      )
    );
 

  @Async
  @Transactional
  public void asycnUpdateUser(User user) {
    jdbcTemplate.update(
      "UPDATE users SET last_login = ? WHERE id = ?",
      LocalDateTime.now(), user.getId()
    );
  }
}
Açıklaması şöyle
Async threads can scale independently, but database connections cannot. This quickly overwhelms the connection pool.

9 Nisan 2026 Perşembe

Distributed Lock Source of Truth Olabilir mi?

Giriş
Soru şöyle
You have a distributed lock to prevent two users from booking the same hotel room.

Lock expires in 5 seconds. Your DB write takes 6 seconds under load.

Two users got confirmed bookins for the same room. How? What is the process to fix this issue.
Aslında şuna dikkat etmek lazım.
Lock ≠ correctness.
If your DB allows duplicates, your system will eventually produce them.
The real fix lives in atomic writes + constraints, not just distributed locks.
Yani lock aslında işlemi en baştan yapmamak için. Eğer iki işlem başlarsa bir tanesi başarısız olmalı.

Açıklaması şöyle
This is a correctness question. And at the Senior to Principal level, this is exactly what interviewers are testing for: do you understand the difference between coordination and actual data integrity?

If you are preparing for system design interviews right now, this is the kind of failure-mode thinking that matters a lot in strong loops.

Now, let us break this one down properly.

[1] How did both users get confirmed bookings?

The timeline usually looks like this:

- User A acquires the distributed lock for Room 101
- Lock lease is valid for 5 seconds
- User A starts the DB write to mark the room as booked
- Under load, that DB write takes 6 seconds
- At second 5, the lock expires before User A finishes
- User B now acquires the same lock because the lock service thinks it is free
- User B also starts a booking write
- Both flows eventually return success, and both users get confirmations

So what actually failed here? The system assumed the distributed lock was the source of truth. A lease-based lock only gives you temporary coordination.

If the critical section takes longer than the lease, another actor can enter while the first one is still working.

I cover fundamentals like locking, transactions, consistency, retries, idempotency, and failure handling in much more depth inside my System Design Fundamentals Guide for Senior to Principal engineers.

You can check it out here: puneetpatwari.in

[2] The deeper bug is usually not the lock itself

A lot of candidates stop at “increase the lock timeout.” That is not the real fix. The deeper issue is that your final correctness guarantee is missing at the database layer.

Because even if the lock expires, the database should still protect the invariant: “Only one valid booking can exist for this room for this date range.”

If both writes succeeded, it usually means one of these is true:
- no proper uniqueness or exclusion constraint existed
- booking availability was checked outside the final transaction
- writes were not serialized with row-level locking
- confirmation was sent before durable conflict detection finished

The lock helped reduce contention.
But the DB failed to enforce correctness.

[3] What is the right process to fix it

I would fix this in 4 steps.

1. Reconstruct the exact race
Check lock acquire time, lock expiry time, DB commit time, and confirmation event time for both users.

2. Move the invariant to the database

For hotel booking, correctness should be enforced with transactional logic such as:
- row-level locking on the inventory row
- atomic reserve-if-available update
- or exclusion/uniqueness constraints depending on data model

3. Treat the distributed lock as an optimization.
It can reduce hot contention, but it should never be the only thing preventing double booking.

4. Fix the confirmation path
Only send “booking confirmed” after the transaction commits successfully and conflict checks have passed.

5] If you still want to use distributed locks, do it safely

If a distributed lock stays in the design, I would add:
- lease renewal or heartbeats for long critical sections
- fencing tokens so stale lock holders cannot keep writing
- alerts when p99 DB latency gets too close to lock TTL
- idempotency keys so retries do not create duplicate booking flows

A good rule of thumb is simple: If your lock TTL is 5 seconds and your write path can take 6 seconds under load, your design is already telling you it is unsafe.

8 Nisan 2026 Çarşamba

Correlation Id vs Trace Id

Giriş
Açıklaması şöyle
I often noticed that some developers do not really understand the difference between traceId and correlationId. I saw this so often that I decided to write this post.

At first they look similar.
Both are IDs.
Both appear in logs.
Both help during incidents.

But they answer different questions.

traceId answers:
"How did this specific execution path go through the system?"

correlationId answers:
"Which logs and events belong to the same business story?"

That difference becomes obvious once async enters the picture 

Example:

A user places an order.

The system does this:

1. Order Service creates the order
2. Payment Service charges the card
3. Kafka event is published
4. Billing Worker creates invoice
5. Email Service sends confirmation

Now imagine the logs:

Order created
correlationId=ORDER-8472
traceId=T1

Payment charged
correlationId=ORDER-8472
traceId=T1

Billing started from Kafka consumer
correlationId=ORDER-8472
traceId=T2

Email sending failed
correlationId=ORDER-8472
traceId=T3

This is the key point 

One correlationId
Multiple traceIds

Why?

Because the business flow is one.
But the technical executions are split.

The HTTP request is one execution.
Kafka consumer is another.
Retry later can be another.
Email worker can be another too.

So:

correlationId helps you reconstruct the whole story.
traceId helps you inspect one exact path in detail.

That is why using correlationId instead of tracing is a mistake.
You may connect logs, but you still do not get spans, timing hierarchy, or where exactly latency exploded.

And using only traceId is also not enough.
In distributed async systems, tracing often shows fragments. Correlation is what lets you stitch them back together 🧩

How I usually use them during incidents:

1. Start with correlationId
Find everything related to the same order, job, or user flow.

2. Then drill into traceId
Open the exact failing execution and inspect where it slowed down or broke.

Simple version:

traceId = the path
correlationId = the story

Have you seen teams mix these two and then realize the difference only during a production incident? 

Fencing Tokens

Giriş
Açıklaması şöyle
Distributed systems concept: Fencing Tokens
You designed a fancy distributed locking algorithm just to find that an old primary is able to overwrite data!

The problem:
- Node A holds the lock, and is doing some work.
- Node A gets disconnected/unresponsive/crashes, and resume execution after its lease expires ("true" time)
- Node B, in the meantime, acquired the lock and wrote some data.
- Node A resume executions, thinking their lock is still valid
- Node A overwrites the data written by Node B, even tho it doesn't have the lock anymore.

That's were fencing token comes in: when a node acquires the lock, it gets a token with a monotonically increasing number. When the node tries to write data, it must include the token. If the token is outdated (i.e., lower than the current token), the write is rejected, preventing stale nodes from overwriting newer data.

Fencing tokens are used in a variety of systems, like etcd

The big takeaway is that you can't rely on just the client to know whether they are in their right. The target resource must have a gating mechanism to verify that the request makes sense.


JSON Web Token - JWT ve Hemen Logout

Giriş
Eğer tamamen stateless çalışıyorsak hemen logout mümkün değil. Ancak sunucu tarafına biraz state eklersek bazı çözümler elde ederiz.

1. Short-lived access tokens
- Keep access tokens valid for 5 to 15 minutes
- This limits the damage window
- Very common and simple

2. Refresh token revocation
- Store refresh tokens in DB or Redis
- On logout, delete or mark them revoked
- This is the most common real-world pattern

3. Token blacklist / denylist
- Store revoked JWT IDs or token hashes until they expire
- Check this list on every request
- Useful for high-risk logout or compromised accounts
- But now auth is no longer fully stateless

4. Token versioning
- Store a tokenVersion or sessionVersion on the user record
- Include that version in the JWT
- On logout-all-devices or password reset, increment the version
- Old tokens stop working once the version mismatches

26 Mart 2026 Perşembe

Yazılım Mimarisi - Idempotency ve Phantom Write

Giriş
Açıklaması şöyle
You typically implement idempotency like this:
  1. Check if request already processed (via key / timestamp / PK)
  2. If not → write data
  3. If yes → skip
Eğer check işlemi atomic değilse problem oluyor.

Failure Mode 1: The TTL Expiry Trap
Açıklaması şöyle
The most common idempotency implementation stores a request key with a time-to-live (TTL) — typically 24 or 48 hours. The assumption is that any duplicate will arrive within that window. In practice, this assumption frequently breaks.
Açıklaması şöyle
The fix: Never use TTL-only idempotency for operations with unbounded retry windows. Instead, use a database-backed idempotency store with a three-state model (IN_PROGRESS, COMPLETED, FAILED) where the expires_at column drives a cleanup job for storage management — not correctness. The cleanup window should be set significantly longer than your worst-case replay window (7 days minimum for Kafka-based systems).
Failure Mode 2: The Partial Execution Ghost
Açıklaması şöyle
A request arrives, the system writes the idempotency key with status IN_PROGRESS, begins processing, writes half the data, and crashes — JVM OOM, container eviction, network partition. The idempotency key is now in IN_PROGRESS state. When the retry arrives, the system faces an impossible decision: did the original operation complete or not?
Açıklaması şöyle
The fix: Wrap both the business logic and the idempotency state transition in a single database transaction. If the transaction rolls back, both the business data and the idempotency status roll back together. For stale IN_PROGRESS keys (where the original processor is likely dead), use a configurable timeout threshold to reclaim and re-execute safely.
Failure Mode 3: The Concurrent Check Race
Burada check koşulu atomic değil. Açıklaması şöyle
The fix: Use INSERT ... ON CONFLICT DO NOTHING (PostgreSQL 9.5+) to make the check-and-claim atomic. If the RETURNING clause yields no rows, the key already existed — fetch its status with SELECT ... FOR UPDATE. For non-blocking behavior, SELECT ... FOR UPDATE SKIP LOCKED lets the second instance return 409 Conflict immediately rather than waiting.
Failure Mode 4: The Layer Mismatch
Açıklaması şöyle
The fix: Propagate a correlation ID from the original request as a Kafka header, and have every downstream consumer enforce its own idempotency barrier using that ID as the deduplication key.
Spring Boot + SQL Server
Kod şöyle. Burada 
Partial Execution tek transaction ile çözülüyor.
The Concurrent Check Race, DuplicateKeyException ile çözülüyor. Eğer Postgres kullanıyor olsaydık exception yerine SQL'in kaç tane satırı değiştirdiğine bakacaktır
- The Layer Mismatch sorunu outbox pattern ile çözülüyor.
@Service
@RequiredArgsConstructor
public class IdempotentService {
  private final JdbcTemplate jdbc;
  public record Response(String result) {}

  @Transactional
  public Response handleRequest(String idempotencyKey, String payload) {
    try {
      // Attempt barrier insert (atomic)
      // SQL Server:
      // INSERT INTO idempotency_table (idempotency_key, status)
      // VALUES (?, 'IN_PROGRESS')
      jdbc.update(
        "INSERT INTO idempotency_table (idempotency_key, status) VALUES (?, 'IN_PROGRESS')",
        idempotencyKey
      );

      // First request owns the key → perform business logic
      String result = doBusinessLogic(payload);

      // Insert into outbox for async processing
      // SQL Server:
      // INSERT INTO outbox_table (idempotency_key, payload) VALUES (?, ?)
      jdbc.update(
        "INSERT INTO outbox_table (idempotency_key, payload) VALUES (?, ?)",
        idempotencyKey, result
      );

      // Mark barrier as completed and store result
      // SQL Server:
      // UPDATE idempotency_table SET status='COMPLETED', response=? WHERE idempotency_key=?
      jdbc.update(
        "UPDATE idempotency_table SET status='COMPLETED', response=? WHERE idempotency_key=?",
        result, idempotencyKey
      );
      return new Response(result);
     } catch (DuplicateKeyException ex) {
      // Barrier row already exists → handle duplicate
       // SQL Server:
       // SELECT * FROM idempotency_table WITH (UPDLOCK, ROWLOCK) WHERE idempotency_key=?
       IdempotencyRecord record = jdbc.queryForObject(
         "SELECT status, response FROM idempotency_table WITH (UPDLOCK, ROWLOCK) WHERE idempotency_key=?",
         (rs, rowNum) -> new IdempotencyRecord(rs.getString("status"), rs.getString("response")),
         idempotencyKey
       );

       switch (record.status) {
         case "COMPLETED":
           // Return cached result
           return new Response(record.response);
         case "IN_PROGRESS":
           // Someone else is working → can wait or throw 409
           throw new IllegalStateException("Request is already in progress");
         case "FAILED":
           // Previous attempt failed → allow retry
           throw new IllegalStateException("Previous attempt failed, safe to retry");
         default:
           throw new IllegalStateException("Unknown barrier state: " + record.status);
         }
      }
  }

  private String doBusinessLogic(String payload) {
    // your domain logic here
    return "processed:" + payload;
  }

  private static class IdempotencyRecord {
      final String status;
      final String response;
      IdempotencyRecord(String status, String response) {
        this.status = status;
        this.response = response;
      }
  }
}
Eğer hem SQL Server hem de Postgres için çalışsın istiyorsak şöyle yaparızz
    
    
@Service
@RequiredArgsConstructor
public class IdempotentService {

    private final JdbcTemplate jdbc;

    public record Response(String result) {}

    @Transactional
    public Response handleRequest(String idempotencyKey, String payload) {
        boolean isWinner = false;

        try {
            // --------------------------
            // Attempt atomic barrier insert
            // --------------------------
            // Postgres:
            // INSERT INTO idempotency_table (idempotency_key, status)
            // VALUES (?, 'IN_PROGRESS')
            // ON CONFLICT DO NOTHING
            //
            // SQL Server:
            // INSERT INTO idempotency_table (idempotency_key, status)
            // VALUES (?, 'IN_PROGRESS')
            int rows = jdbc.update(
                    "INSERT INTO idempotency_table (idempotency_key, status) VALUES (?, 'IN_PROGRESS')",
                    idempotencyKey
            );

            // Postgres: rows == 1 → winner
            // SQL Server: INSERT succeeded → winner
            isWinner = rows == 1;

        } catch (DuplicateKeyException ex) {
            // SQL Server only: duplicate → loser
            isWinner = false;
        }

        if (isWinner) {
            // --------------------------
            // Winner executes business logic
            // --------------------------
            String result = doBusinessLogic(payload);

            // Insert into outbox (side effect)
            // INSERT INTO outbox_table (idempotency_key, payload) VALUES (?, ?)
            jdbc.update(
                    "INSERT INTO outbox_table (idempotency_key, payload) VALUES (?, ?)",
                    idempotencyKey, result
            );

            // Mark barrier as completed + store response
            // UPDATE idempotency_table SET status='COMPLETED', response=? WHERE idempotency_key=?
            jdbc.update(
                    "UPDATE idempotency_table SET status='COMPLETED', response=? WHERE idempotency_key=?",
                    result, idempotencyKey
            );

            return new Response(result);
        } else {
            // --------------------------
            // Loser reads existing row safely
            // --------------------------
            // SQL Server: SELECT ... WITH (UPDLOCK, ROWLOCK) WHERE idempotency_key=?
            // Postgres: SELECT * FROM idempotency_table WHERE idempotency_key=?
            IdempotencyRecord record = jdbc.queryForObject(
                    "SELECT status, response FROM idempotency_table " +
                            (isPostgres() ? "" : "WITH (UPDLOCK, ROWLOCK) ") +
                            "WHERE idempotency_key=?",
                    (rs, rowNum) -> new IdempotencyRecord(rs.getString("status"), rs.getString("response")),
                    idempotencyKey
            );

            switch (record.status) {
                case "COMPLETED":
                    return new Response(record.response);
                case "IN_PROGRESS":
                    throw new IllegalStateException("Request already in progress");
                case "FAILED":
                    throw new IllegalStateException("Previous attempt failed, safe to retry");
                default:
                    throw new IllegalStateException("Unknown barrier state: " + record.status);
            }
        }
    }

    private boolean isPostgres() {
        // Detect DB type from DataSource or JdbcTemplate if needed
        return true; // placeholder, implement detection
    }

    private String doBusinessLogic(String payload) {
        return "processed:" + payload;
    }

    private static class IdempotencyRecord {
        final String status;
        final String response;

        IdempotencyRecord(String status, String response) {
            this.status = status;
            this.response = response;
        }
    }
}


25 Mart 2026 Çarşamba

Claude

Giriş
Bir örnek burada. Şeklen şöyle



1. Claude.md Dosyası
Ana kontrol dosyası. Örneğin 
- Asla main brach'i kullanma 

2. CLAUDE.local.md Dosyası
Açıklaması şöyle.
CLAUDE.local.md is useful for notes you do not want to commit but still want to apply in the current project.

3. subdirectories 
Açıklaması şöyle
- CLAUDE.md files inside subdirectories are not all loaded up front, but only when Claude Code actually reads content from those directories
- When multiple CLAUDE.md files are active at the same time, a nearest-scope rule usually applies, meaning instructions closer to the current task and narrower in scope take priority
- Within the same layer, rules that are more explicit and more specific are also more likely to be followed consistently than vague general statements
4. .claude Dizini

.claude/commands
tekrar eden işleri otomatikleştirme

4.1 .claude/rules
proje kuralları (test, naming, vs.)

Komutlar
/init
Başlangıç CLAUDE.md dosyasını yaratır.

/reflection for Regular Retrospectives
Açıklaması şöyle
At the end of each session, you can ask Claude Code to summarize what from that round of collaboration is worth adding to CLAUDE.md, and then turn those points into more stable project rules.
/skill-creator
Açıklaması şöyle.
A skill isn't a prompt. You don't type it. You build it once, describe what it does and when to use it, and Claude recognises when to fire it on its own. The right context appears, the skill runs. You do nothing.
Özel bir skill yapılandırmak için bu komutu kullanırız. Açıklaması şöyle.
You describe what you need, it helps you draft the skill, then runs a test (one session with the skill, one without) and opens a browser window so you can compare the results. Then it optimises automatically based on your feedback so the skill triggers when it should.