Yazılım Çorbası: fault tolerance

fault tolerance etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster

10 Kasım 2022 Perşembe

Fault Tolerance ve Resiliency İçin Dead-Letter Queue Örüntüsü

Giriş

Bir mesaj bir kaç defa denendikten sonra (Retry Örüntüsü) halen işlenemiyorsa muhtemelen insan müdahalesi gerekir. Dead Letter Channel aynı zamanda Dead Letter Queue olarak ta bilinir. Şeklen şöyle

Açıklaması şöyle

1. Under normal circumstances, the application processes each event in the source topic and publishes the result to the target topic
2. Events that cannot be processed, for example, those that don’t have the expected format or are missing required attributes, are routed to the error topic
3. Events for which dependent data is not available are routed to a retry topic where a retry instance of your application periodically attempts to process the events

Eğer birbirine bağımlı mesajlar varsa bunların da sırayı muhafaza etmek için aynı kuyruğa gönderilmesine dikkat etmek gerekir

Dead-Letter Kelime Anlamı Nedir?

Açıklaması şöyle.

What Is a Dead Letter Queue?
In English vocabulary, dead letter mail is undeliverable mail that cannot be delivered to the addressee. A dead-letter queue (DLQ), sometimes known as an undelivered-message queue, is a holding queue for messages that cannot be delivered to their destinations due to something.

According to Wikipedia — In message queueing the dead letter queue is a service implementation to store messages that meet one or more of the following failure criteria:

- Message that is sent to a queue that does not exist
- Queue length limit exceeded
- Message length limit exceeded
- Message is rejected by another queue exchange
- Message reaches a threshold read counter number because it is not consumed. Sometimes this is called a “back out queue”

İnsan Müdahalesi İçin Bazı Örnekler

Örnek

Bir örnek şöyle

A message arriving in the error queue can trigger an alert and the support team can decide what to do. And this is important: You don't need to automate all edge cases in your business process. What's the point in spending a sprint to automate this case, if it only happens once every two years? The costs will definitely outweigh the benefits. Instead, we can define a manual business process for handling these edge cases.

In our example, if Bob from IT sees a message in the error queue, he can inspect it and see that it failed with a CannotShipOrderException. In this case, he can notify the Shipping department and they can use another shipment provider. But all of this happens outside of the system, so the system is less complex and easier to build.

Örnek

Hatalı alan için bir örnek şöyle

However, if the error is not ever possible to solve via a retry process (such as a never ever unhandled case, maybe a corrupt field value, e.g.), you should create an “error topic” which is called a “dead-letter queue”.

25 Ocak 2021 Pazartesi

Fault Tolerance ve Resiliency İçin Timeout Örüntüsü

Giriş

Timeout kelimesi "Zaman Aşımı" anlamına gelir. Sistem içinde hata olsa bile, hatalı kısmı kullanmaya çalışan alt sistem sonsuza kadar takılıp kalmaz. Belli bir müddet cevap bekledikten sonra hata kodu döner

Timeout yöntemi ile mitigating action yani B planı devreye sokulur.

Timeout Değeri Nasıl Hesaplanır?

Şöyle bir yöntem izlenebilir. 1000 tane isteğin cevap verme süresi bulunur. Tüm cevapların %95'ini kapsayacak bir değer hesaplanır. Bu değer timeout süresidir.

Örnek - Resilience4j

Şu satırı dahil ederiz.

<dependency>
  <groupId>io.github.resilience4j</groupId>
  <artifactId>resilience4j-spring-boot2</artifactId>
  <version>1.1.0</version>
</dependency>

Elimizde şöyle bir dosya olsun.

resilience4j.timelimiter:
  instances:
    ratingService:
      timeoutDuration: 3s
      cancelRunningFuture: true
    someOtherService:
      timeoutDuration: 1s
      cancelRunningFuture: false
---
rating:
  service:
    endpoint: http://localhost:7070/ratings/

Şöyle yaparız

import io.github.resilience4j.timelimiter.annotation.TimeLimiter;

@Service
public class RatingServiceClient {

  private final RestTemplate restTemplate = new RestTemplate();

  private String ratingService = ...;

  @TimeLimiter(name = "ratingService", fallbackMethod = "getDefault")
  public CompletionStage<ProductRatingDto> getProductRatingDto(int productId){
    Supplier<ProductRatingDto> supplier = () ->
      this.restTemplate.getForEntity(this.ratingService + productId,
        ProductRatingDto.class)
        .getBody();
      return CompletableFuture.supplyAsync(supplier);
    }

  private CompletionStage<ProductRatingDto> getDefault(int productId, Throwable throwable){
    return CompletableFuture.supplyAsync(() ->
      ProductRatingDto.of(0, Collections.emptyList()));
  }
}

29 Aralık 2020 Salı

Fault Tolerance ve Resiliency İçin Circuit Breaker Örüntüsü

Giriş

Bu yazı aslında Fault Tolerance ve Resiliency İçin Bazı Yazılım Çözümleri serisinin bir parçası.

Bu yöntem microservice mimarisinde kullanılıyor. Eğer bir sistem cevap vermiyorsa ona sürekli erişmeye çalışmak yerine sigorta (circuit breaker) devreye girer ve direkt hata mesajı/kodu döndürür. Açıklaması şöyle.

In a nutshell — Do not hammer a service with additional requests which is already down. Please give the service time to recover.

Açıklaması şöyle.

Circuit Breaker monitors API calls. When everything is working as expected, it is in the state closed. When the number of fails, like timeout, reaches a specified threshold, Circuit Breaker will stop processing further requests. We call it the open state. As a result, API clients will receive instant information that something went wrong without waiting for the timeout to come.

The Circuit is opened for a specified period of time. After timeout occurs, the circuit breaker goes into the half-opened state. Next, the API call will hit the external system/API. After that, the circuit will decide whether to close or open itself.

Sigorta 3 durumda olabilir. Şeklen şöyle

CLOSED Durumu - Her şey Yolunda

Açıklaması şöyle. Belli bir sayıda hata aldıktan sonra OPEN durumuna geçer.

The circuit breaker being in a CLOSED state means that everything is working fine and all calls pass through to the remote services. Once the number of failures exceeds a predetermined threshold, the circuit breaker trips and enters into the open state.

OPEN Durumu - Sigorta Atmıştır

Açıklaması şöyle. Direkt hata kodu döndürür. Belli bir süre sonra HALF-OPEN durumuna geçiş yapılır

Once the number of timeouts reaches a predetermined threshold in the circuit breaker, it trips the circuit breaker to the OPEN state. In the OPEN state, the circuit breaker returns an error for all calls to the service without making the calls to the remote service.

HALF-OPEN Durumu

Açıklaması şöyle. Halen hata kodu döndürür, ancak servisleri kontrol etmek için deneme çağrıları yapar.

After a certain duration, the circuit switches to a HALF-OPEN state to test if the underlying problem still exists. The circuit breaker uses a mechanism to make a trial call to the remote service periodically to check if it has recovered. If the call to the Remote service fails, the circuit breaker remains in the OPEN state. If the call returns success, then the circuit switches to the CLOSED state.

State Geçişi

Açıklaması şöyle

There are 2 types of circuit breaker patterns, Count-based and Time-based.

1. Count-based: the circuit breaker switches from a closed state to an open state when the last N requests have failed or timeout.
2. Time-based: the circuit breaker switches from a closed state to an open state when the last N time unit has failed or timeout.

In both types of circuit breakers, we can determine what the threshold for failure or timeout is. Suppose we specify that the circuit breaker will trip and go to the Open state when 50% of the last 20 requests took more than 2s, or for a time-based, we can specify that 50% of the last 60 seconds of requests took more than 5s.

After we know how the circuit breaker works, then we will try to implement it in the spring boot project.

Circuit Breaker Kütüphaneleri

- Netflix’s Hystrix

- Resilience4j @CircuitBreaker Anotasyonu

28 Aralık 2020 Pazartesi

Fault Tolerance ve Resiliency İçin Retry Örüntüsü

Giriş

Bu yazı aslında Fault Tolerance ve Resiliency İçin Bazı Yazılım Çözümleri serisinin bir parçası.

Retry-After Parametresi

REST çağrılarında sunucu Http cevabına Retry-After parametresini atayabilir. Http Cevap Parametreleri yazısına bakabilirsiniz.

Retry İşlemini Kim Yapmalı

1. Her servis kendisi Retry işlemini gerçekleştirebilir

2. Ortak bir Retry servisi olabilir. Açıklaması şöyle

... it removes the retry complexity from all microservices, and places it in a single retry microservice. The retry microservice’s job is to track and action all retries. This microservice receives an event, writing it to its own topics with both the event to retry and the timestamp to retry that event. It then pushes out these retry events once their timestamp has been reached.

Retry Verisi Nerede Saklanmalı

1. Kafka, AMQP gibi bir kuyrukta saklanabilir

2. Veri tabanında saklanabilir

Retry Detayları

Sistem içinde hata olsa bile işlem belli bir süre sonra tekrar denenir. Retry yönteminde dikkat edilecek noktalar şöyle

1. Retry sayısı

2. Retry aralığı (back off süresi)

3. Tüm denemeler başarısız olursa ne yapılacağı

4. Retry işleminin stateless veya stateful olacağı

5. Retry 'ın çağırdığı şilemin Idempotent Receiver olması. Idempotency Nedir yazısına bakabilirsiniz.

Bazen Retry alt yapısını bizim kodlamamız gerekir, eğer şanslıysak kullandığımız alt yapı bu yeteneği bütünüyle veya kısmen de sağlıyor olabilir.

Not : Bu konuda okunması gereken bir yaz burada

Retry Storms

Elimizde bir servis çağrısı dizisi olsun

A -> B -> C -> D

ve D çalışmıyor olsun. Bu durum Retry Storm sebebidir. Açıklaması şöyle

Now imagine that each service has a retry policy installed which performs up to 3 retries on failed calls (a total of 3+1 requests). Think about what happens if D goes down, and this failure is propagated all the way back up the call chain:
C calls D 4x more
B calls C 4x more
A calls B 4x more

The traffic D receives might be as much as 64 times (!!) that of normal levels. And this is happening when D is already unhealthy — likely impairing it further.

Sounds bad, but the real problem is that these traffic increases grow exponentially. If K is the number of attempts made per call at each node, then the magnitude of the increase in call volume at depth N in the call graph is K^N. This behavior is known as a retry storm, and it can compound minor outages into major cascading failures.

1. Retry Sayısı

Mantıklı bir üst sınıf vermekte fayda var. Yoksa işlem/mesaj sonsuz döngü şeklinde sürekli tekrar tekrar kaynak tüketir.

2. Retry Aralığı

Açıklaması şöyle. Sabit, rastgele veya artan aralıklar kullanılabilir.

You retry few times hoping to get a response, with a fixed, random or exponential wait period in between each attempt and eventually give up to try in the next poll cycle.

Örnek
Java kullanıyorsak elimizde şöyle bir kod olsun .

ScheduledExecutorService service = Executors.newScheduledThreadPool(5);

Şöyle yaparız.

service.execute(task);
...
service.schedule(task, 20, TimeUnit.SECONDS);//retry

Örnek
Java kullanıyorsak elimizde şöyle bir kod olsun . java.util.Timer kullanıyoruz

Timer timer = new Timer();

Şöyle yaparız.

timer.schedule(task, 20);

3. Tüm denemeler başarısız olursa ne yapılacağı

Aslında iki tane seçenek var

1. Dead Letter Queue veya benzeri bir yere bir mesaj gönderilebilir.

2. İşlem/mesaj tamamen çöpe atılır

4. Retry işleminin stateless veya stateful olacağı

- Stateless retry işlemlerinden retry mekanizması aynı thread içinde N defa işlemi dener. Thread bu N deneme süresince başka bir işle uğraşmaz.

- Stateful retry işlemlerinden retry mekanizması aynı thread içinde işlemi dener. Eğer başarısız ise, thread başka bir işe geri döner. Yani bloke olmaz. Retry zamanı gelince tekrar dener.

Örnek - Resilience4j

Şöyle yaparız

public List<CompanyDto> searchCompanyByName(String name) {
  RetryConfig retryConfig =
    RetryConfig.custom().maxAttempts(4).waitDuration(Duration.of(2, SECONDS)).build();

  RetryRegistry retryRegistry = RetryRegistry.of(retryConfig);

  Retry retryConfiguration = retryRegistry.retry("companySearchService", retryConfig);

  Supplier<List> companiesSupplier = () -> companyRepository.findAllByName(name);

  Supplier<List> retryingCompaniesSearch =
    Retry.decorateSupplier(retryConfiguration, companiesSupplier);

  List<CompanyDto> companyDtos = new ArrayList<>();
  List companies = retryingCompaniesSearch.get();

  for(Company company : companies) {
    CompanyDto companyDto = new CompanyDto(company.getName(), company.getType(),
      company.getCity(), company.getState(), company.getDescription());
      companyDtos.add(companyDto);
  }

  return companyDtos;
}

Açıklaması şöyle

While using resilience4j-retry library, you can register a custom global RetryConfig with a RetryRegistry builder. Use this registry to build a Retry.

In the above method, we first create RetryConfig. We create a RetryRegistry and add RetryConfig in this registry. Then when we create our call to fetch a list of companies. We decorate this call with retryConfiguration.

RetryConfig açıklaması şöyle

Customizations with Resilience4j-Retry
RetryConfig offers different customization:
1. maxAttempts — 3 is the default number of attempts for retries.
2. waitDuration — a fixed wait duration between each retry attempt.
3. intervalFunction — a function to modify the waiting interval after a failure.
4. retryOnResultPredicate — configures a predicate that evaluates if a result should be retried.
5. retryExceptions — Configures a list of throwable classes that are used for retrying
6. ignoreExceptions — Configures a list of throwable classes that are ignored
7. failAfterMaxRetries — A boolean to enable or disable throwing of MaxRetriesExceededException when the Retry has reached the configured maxAttempts

Retry Design Pattern vs Circuit Breaker Pattern

Açıklaması şöyle. Circuit Breaker eğer bir sistemin çalışmadığını anlarsa Retry işlemine girmeden direkt bir cevap döner. Bu yüzden Retry örüntüsüne göre biraz farklıdır.

I would like to mention, a subtle difference with "Circuit Breaker" pattern, which is actually one level up. It has an implicit Retry but also prevents further communication until the remote service is available and responds with a test call. This strategy is recommended when you expect services to be unavailable for a longer duration whereas "Retry" is recommended for transient failures (short duration or temporary failures).

Doğası Gereği Retry Olması Gereken İşlemler

Bazı sistemlerin doğasında Retry vardır. Karşı sistemden bir cevap dönmüyorsa, kullanılan alt yapı, iletişim ağı güvenilir değilse ister istemez Retry benzeri bir çözüm sistemde kullanılmaya başlanıyor.

Örnek

Bir sistemde UDP kullanıldığı için mesajlar kaybolabiliyordu. "Transmission Queue" bileşeninde her mesaj aynı birincil anahtar ile N defa tekrar edilecek şekilde bir çözüm geliştirildi. Böylece Retry örüntüsü aslında bir nevi gerçekleştirilmiş oldu.

14 Mayıs 2020 Perşembe

Fault Tolerance ve Resiliency İçin Bazı Yazılım Çözümleri (Design Pattern)

Giriş
Bu yazıdaki kavramlar
1. Fault Tolerance - Arızaya Dayanıklılık
2. Resiliency - Dirençlilik
için kullanılabilir.

1. Cache Pattern

Açıklaması şöyle. Yavaş bir sistemi daha da yavaşlatmamak için bazı cevaplar önbelleğe alınabilir.

Slow backend systems are often misunderstood beasts, a bit like Ogres. The reason for their slowness or unavailability is often down to their business criticalness. These systems are bombarded with requests more than they are provisioned to handle, resulting in these systems being slow or unavailable. The solutions I’ve proposed here are based on two factors.

- One, reduce the number of requests hitting the backend.
- Two, avoid loading the backend service when it’s already overloaded (avoid peak times).

Although you might not be able to get rid of the beast, you might learn to live with it.

Bir backend sisteme önbellek ilave ederken izlenebilecek akışlar farkı farklı olabilir.

1. 1 Soru Soran Sadece Cache Sistemi Bilir

Açıklaması şöyle.

The first option would be the simplest and has no dependency on the backend service. When the cache node is requested for information it does not have, it will fetch it from the slow backend, update the cache and return it to the requester.

1. 2 Soru Soran Hem Cache Hem de Backend Sistemi Bilir

Bu akışta soru soran istediği cevabı cache sistemde bulamazsa, backend sisteme sorar.

1.3 Her iki akışta da şuna dikkat etmek lazım

Açıklaması şöyle. Yani backend ise cache arasında bir bağlantı olmalı

... the cache gets updated whenever the backend data gets updated. This requires an integration with the backend service, and the backend service should be equipped to either call an API on the cache node or should have logs that could be processed to extract the required information.

2. Retry Design Pattern
Retry Design Pattern yazısına taşıdım

3. Timeout Yöntemi

Timeout Pattern yazısına taşıdım

4. Load Balancing and Failover

Bu konuyu High Availability - Yüksek Süreklilik yazısına taşıdım

5. Circuit Breaker Pattern
Circuit Breaker Pattern yazısına taşıdım

6. Dead Letter Channel Yöntemi

Dead-Letter Channel Örüntüsü yazısına taşıdım.

7. Saga Yöntemi
Saga Örüntüsü yazısına taşıdım.

8. Bulkhead Yöntemi

Şeklen şöyle. Burada geminin bölmelere ayrıldığı görülebilir. Bölme duvarlarına bulkhead deniliyor. Böylece eğer gemi bir bölmeden yara alsa bilebölmeler sayesinde batmıyor.

Gemideki bu kullanıcımı yazılım dünyasına taşıyan bir örnek şöyle. A servisi hem kendi başına çalışabilen hem B'yi kullanan servisler sunuyor. Eğer B yavaşlarsa, tüm A'nın yavaşlamaması gerekir. Bu yüzden B'ye ayrılan kaynaklar sınırlandırılıyor.

let's assume that there are 2 services A and B. Some of the APIs of A depends on B. For some reason, B is very slow. So, When we get multiple concurrent requests to A which depends on B, A’s performance will also get affected. It could block A’s threads. Due to that A might not be able to serve other requests which do NOT depend on B. So, the idea here is to isolate resources / allocate some threads in A for B. So that We do not consume all the threads of A and prevent A from hanging for all the requests!

Bulkhead için Resilience4j @BulkHead Anotasyonu, Hystrix kullanılabilir.

Örnek

Resilience4j kullanan bir örnek burada

9. Rate Limiter Yöntemi

Resilience4j @RateLimiter Anotasyonu yazısına taşıdım

11 Temmuz 2018 Çarşamba

Robustness - Dayanıklılık

Robustness Nedir?
Açıklaması şöyle. Robustness (dayanıklılık) hatalı girdi alsa bile uygulamanın/metodun bu girdi ile başa çıkıp, çalışabilmesi anlamına gelir.

Robustness denotes the degree to which a system is able to withstand an unexpected internal or external event or change without degradation in system's performance. To put it differently, assuming two systems - A and B - of equal performance, the robustness of system A is greater than that of system B if the same unexpected impact on both systems leaves system A with greater performance than system B.

We stress the word unexpected because the concept of robustness focuses specifically on performance not only under ordinary, anticipated conditions (which a well designed system should be prepared to withstand) but also under unusual conditions that stress its designers' assumptions.

Bir başka örnek şöyle.

According to Dr. Woods, "robustness" is the word we most commonly use interchangeably with "resilience." Robustness is about more than our system's ability to rebound from a specific shock; specifically, it's about how our systems absorb a wider and wider range of disturbances without breaking. So when we ask if our systems are "resilient" to particular kinds of failure, we're really asking if they're "robust" enough to handle those failures.

Think of a bridge. A good bridge can withstand stress from wind and weather; it can support constant traffic; it can lose a certain number of supports and still maintain its integrity, say, across a canyon. The builders of the bridge anticipated any number of disturbances and then engineered the system to be able to withstand those disturbances within certain tolerances.

1. Metod Seviyesinde Robustness

Genelde İki Şekilde Olabilir

1.Hatalar emilir ve varsayılan bir davranış gösterilir

2. Hatalı durum karşısında exception fırlatılır ve devam edilmez.

2. Metodlar İçin Testler

Robustness testi için klasik örnekler olarak, null pointer, out of range değerler gibi hatalı girdi örnekleri verilebilir.

Mesajlaşmalar için hatalı CRC kontrolü örneği verilebilir.

Ayrıca kapasite sınırının zorlanması, state transition için müsaade edilmeyen geçişler de test edilebilir.

3. Fail Fast (Hemen Hata Verme) Robustness

Robustness ve Fail Fast yazısına taşıdım

4. Sistem Seviyesinde Robustness

Burada Robustness koddaki metodlar seviyesinde değil, sistem seviyesinde düşünülüyor.

Örnek - Global Exception Handler

Eğer yakalanması unutulan exception varsa, global bir exception handler kullanılarak exception düzgünce gösterilebilir, hatta hatayı otomatik olarak gönderme/paylaşma işlevi bile eklenebilir.