13 Nisan 2026 Pazartesi

Production Issues Troubleshoot

Bazı problemler şöyle
Here are 15 real production scenario-based questions:

1. Your Spring Boot service CPU suddenly spikes to 90% in production. How will you investigate and fix it?

2. After deployment, your service starts throwing intermittent 500 errors. How will you debug this issue?

3. One microservice goes down and causes a chain failure in other services. How will you prevent this in future?

4. Your API response time increased from 200ms to 3 seconds after a new release. How will you identify the root cause?

5. Database connections are getting exhausted under load. What steps will you take to fix this?

6. A third-party service you depend on is timing out frequently. How will you handle this in your system?

7. You observe duplicate transactions happening in your system. How will you prevent this?

8. Logs are too large and distributed, making debugging difficult. How will you improve observability?

9. Memory usage keeps increasing and your service crashes after some time. How will you detect and fix memory leaks?

10. Your microservice works fine locally but fails in production. How will you approach debugging?

11. A new deployment breaks one feature but works for others. How will you safely roll back?

12. Traffic suddenly spikes 5x during peak hours and your service becomes slow. How will you scale?

13. Inter-service communication is failing due to network latency. How will you optimize it?

14. You need to trace a single request across multiple services during a failure. How will you implement tracing?

15. A bug in one service causes inconsistent data across multiple services. How will you handle data consistency?
Bazı problemle şöyle
Your Spring Boot service runs flawlessly in development, but crashes every night at 2am in production. Walk me through your debugging approach."

Most candidates respond:
‣ I would check the logs.
‣ I would restart the service.
‣ I would increase memory?
‣ Interview over.

Here is what interviewers are actually evaluating:

Step 1: Identify the pattern
2am is consistent. Not random. Not traffic-driven. This indicates a scheduled trigger or resource exhaustion. First question: what executes at 2am? Batch jobs? Scheduled tasks? Cron jobs?

Step 2: Analyze memory behavior before failure
Inspect JVM metrics and heap usage trends. If memory steadily increases from 10pm to 2am before crashing, it signals a memory leak not a functional bug or infrastructure issue.

Step 3: Diagnose the leak
Enable GC logs. Capture heap dumps. Identify objects with abnormal growth unclosed connections, static collections, or uncleared ThreadLocal variables. Even a single unclosed DB connection inside a loop can bring down the service.

Step 4: Validate connection pool utilization
HikariCP default pool size is 10. If a batch process consumes all connections without releasing them, subsequent requests block. By 2am, the pool is exhausted and the service becomes unresponsive.

 Solution: enforce connection timeouts and use proper try-with-resources patterns.

Step 5: Monitor with APM tools
Use Prometheus & Grafana, New Relic, or Datadog. Configure proactive alerts instead of reactive fixes. If heap usage exceeds 80% at 1am, alerts should trigger before failure occurs. That is production-grade engineering.

The gap between 12 LPA and 35 LPA is not defined by frameworks. It is defined by understanding what breaks at 3am and why.
Cpu Spike
Bir başka örnek burada

Database connections are getting exhausted under load
Örnek şöyle
@Service
public class UserService {
  @Autowired
  private JdbcTemplate jdbcTemplate; // OK

  @Transactional
  public void updateUsers(List<User> users) {
    users.forEach(user -> 
      jdbcTemplate.update(
        "UPDATE users SET last_login = ? WHERE id = ?",
        LocalDateTime.now(), user.getId()
      )
    );
 

  @Async
  @Transactional
  public void asycnUpdateUser(User user) {
    jdbcTemplate.update(
      "UPDATE users SET last_login = ? WHERE id = ?",
      LocalDateTime.now(), user.getId()
    );
  }
}
Açıklaması şöyle
Async threads can scale independently, but database connections cannot. This quickly overwhelms the connection pool.

Hiç yorum yok:

Yorum Gönder