Resiliency in Microservices

Resiliency in Microservices

In today’s distributed architectures, microservices have become the backbone of modern applications, enabling scalability, flexibility, and rapid deployment. However, with this shift comes the inevitable challenge of handling system failures gracefully.

Resiliency in microservices is all about designing systems that can withstand unexpected disruptions, degrade gracefully, and recover quickly without impacting user experience.

By implementing resilient patterns and strategies, organizations can ensure their services remain stable and reliable, even in the face of unpredictable conditions or failures in dependent services.

There are several approaches to building resilient applications. In the Java ecosystem, Netflix’s Hystrix was once the go-to library for implementing such patterns. However, Hystrix was moved to maintenance mode in 2018 and is no longer actively developed. Enter Resilience4J, which has swiftly gained popularity as a modern alternative to Hystrix. Resilience4J offers a rich feature set designed to build resilient applications, making it the preferred choice for Java developers today.

  • Circuit Breaker

  • Fallback

  • Retry

  • Rate limit

Avoid Failures, Implement Fallbacks, Enable Self-Healing

How do we avoid cascading failures?

A failure in one service should not trigger a cascading effect across other microservices. In scenarios where multiple microservices are interconnected, it is essential to ensure that the failure of a single service does not compromise the entire system.

The goal is to isolate failures and prevent a breakdown in one service from disrupting the functionality of others, ensuring that the overall ecosystem remains resilient and operational despite individual service outages. A circuit breaker is an effective tool in achieving this isolation.

How do we handle failures gracefully with fallbacks?

In a network of interconnected microservices, implementing a fallback mechanism is essential to maintain system stability when one service fails. This can be achieved by returning default values, serving data from a cache, or redirecting the request to another service or database to retrieve the necessary information.

These fallback strategies ensure that even in the event of a service outage, the system continues to function smoothly, providing users with a seamless experience while minimizing the impact of failures.

How to make our services self healing capable?

It’s crucial to configure timeouts, retries, and implement mechanisms that allow a failed service to recover gracefully. Timeouts ensure that requests to sluggish services don’t hang indefinitely, preserving system responsiveness. Retries can be employed strategically, giving services a chance to recover by attempting the request again after a brief delay.

Additionally, employing techniques like exponential backoff or circuit breakers allows the system to temporarily halt further requests, giving the underperforming service time to recover, while maintaining the overall system’s stability and reliability.

Resilience4j

Resilience4j is a lightweight fault tolerance library designed for functional programming. It offers higher-order functions, known as decorators, that can seamlessly wrap any functional interface, lambda expression, or method reference with features like Circuit Breaker, Rate Limiter, Retry, or Bulkhead.

One of its key strengths is the flexibility it provides—you can apply multiple decorators to a single functional interface or lambda expression, tailoring the fault-tolerance mechanisms precisely to your needs without unnecessary overhead. This modular approach ensures you only use the components required, keeping the system streamlined and efficient.

Suppose we have an API which is designed to return complete user details, with three services providing different parts of the user’s information. These services are interdependent, meaning each service relies on the others to collate and return a complete response.

For example, if Service C fails to respond or experiences significant delays, this failure propagates back to Service B, which depends on Service C for data. As Service B waits or fails, the issue cascades further, eventually impacting the gateway server as well.

In this scenario, a problem with one microservice causes a ripple effect across the entire system. To mitigate this, the Circuit Breaker pattern can be employed.

In conclusion, when a microservice becomes slow or unresponsive, it can consume critical resources like threads on both edge servers and intermediate services, ultimately degrading the overall performance of the microservice architecture. To prevent cascading failures and maintain system resilience, the Circuit Breaker pattern is an effective solution.

Circuit Breaker

The Circuit Breaker pattern is a fault-tolerance mechanism used to prevent cascading failures in a microservices architecture. It works by monitoring the interaction between services and temporarily halting requests to a failing service to avoid further strain on the system.

  1. Closed State (Normal Operation)

    Initially, the circuit breaker is in a closed state, meaning requests to the service proceed as usual. The system tracks failures and successes. If failures occur but are within a defined threshold, the requests continue.

  2. Open State (Failure Detected)

    When the number of failures exceeds the predefined threshold, the circuit breaker opens, and no further requests are sent to the failing service. Instead, an immediate failure is returned, allowing the service to recover without additional load.

  3. Half-Open State (Recovery Mode)

    After a predefined timeout, the circuit breaker moves to a half-open state, allowing a limited number of test requests to flow through. If the service responds successfully, the circuit breaker closes again, resuming normal operation. If failures persist, the circuit breaker reopens.

If requests to a service take too long, the circuit breaker will intervene and terminate the call. The circuit breaker continuously monitors calls to a remote service, and if a sufficient number of failures occur, it will “trip” into an open state, preventing further requests to the failing service and allowing the system to fail fast. This mechanism not only helps conserve resources but also prevents overwhelming the problematic service.

Additionally, the circuit breaker pattern allows the application to periodically check whether the issue has been resolved. If the problem seems fixed, the circuit breaker will allow a limited number of test calls to determine if normal operations can resume.

Advantages of Circuit breaker:

  1. Fail fast

  2. Fail gracefully

  3. Recover seamlessly

Points to build a circuit breaker pattern using Spring Cloud Gateway filter

1. Add Maven Dependency: Add the Maven dependency for the circuit breaker:

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-circuitbreaker-reactor-resilience4j</artifactId>
</dependency>

2. Add Circuit Breaker Filter: In the method where a RouteLocator bean is created, add a filter for the circuit breaker and create a REST API to handle fallback (e.g., /contactSupport):

@Bean
public RouteLocator myCustomRoutes(RouteLocatorBuilder myRouteBuilder) {
    return myRouteBuilder.routes()
        .route(myRoute -> myRoute.path("/mybank/users/**")
            .filters(myFilter -> myFilter.rewritePath("/mybank/users/(?<id>.*)", "/${id}")
                .addResponseHeader("X-User-Response", new Date().toString())
                .circuitBreaker(myCircuit -> myCircuit.setName("userCircuitBreaker")
                    .setFallbackUri("forward:/userSupportFallback")))
            .uri("lb://USER-SERVICE"))
        .build();
}

3. Add Properties: Add the following properties in application.yml for configuration

resilience4j:
  circuitbreaker:
    configs:
      default:
        slidingWindowSize: 10
        permittedNumberOfCallsInHalfOpenState: 2
        failureRateThreshold: 50
        waitDurationInOpenState: 10000
  1. slidingWindowSize: The sliding window size determines how many calls the circuit breaker will observe to measure the success or failure rate. In this case, it will track the last 10 calls. This window is used to calculate the failure rate and decide whether to open or close the circuit.

  2. permittedNumberOfCallsInHalfOpenState: When the circuit breaker transitions from open to half-open (i.e., it’s checking if the service has recovered), this setting defines how many test calls are allowed through. If these 2 calls succeed, the circuit will fully close. If they fail, the circuit will re-open.

  3. failureRateThreshold: This defines the threshold percentage of failures allowed within the sliding window before the circuit breaker opens. In this case, if 50% or more of the calls in the sliding window fail, the circuit breaker will trip into the open state, blocking further requests to the service until it recovers.

  4. waitDurationInOpenState: This sets the duration (in milliseconds) that the circuit breaker will remain in the open state before transitioning to half-open. In this case, after 10,000 milliseconds (10 seconds), the circuit breaker will move to the half-open state, allowing a few test calls to determine if the service has recovered.

Steps to build a circuit breaker pattern using normal Spring Boot service:

1. Add Maven Dependency: Add the following Maven dependency to your pom.xml file.

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-circuitbreaker-resilience4j</artifactId>
</dependency>

2. Add Circuit Breaker Related Changes in Feign Client Interfaces:

  1. Feign Client Interface

     @FeignClient(name = "services", fallback = ServicesFallback.class)
     public interface ServicesFeignClient {
    
         @GetMapping(value = "/api/details", consumes = "application/json")
         public ResponseEntity<ServiceDto> getServiceDetails(
             @RequestHeader("myapp-request-id") String requestId,
             @RequestParam String userNumber);
     }
    
  2. Fallback Class

     @Component
     public class ServicesFallback implements ServicesFeignClient {
    
         @Override
         public ResponseEntity<ServiceDto> getServiceDetails(String requestId, String userNumber) {
             return null;  // You can return a default response here instead of null.
         }
     }
    
  3. Add properties

     spring:
       cloud:
         openfeign:
           circuitbreaker:
             enabled: true
    
     resilience4j:
       circuitbreaker:
         configs:
           default:
             slidingWindowSize: 5
             failureRateThreshold: 50
             waitDurationInOpenState: 10000
             permittedNumberOfCallsInHalfOpenState: 2
    

Retry Pattern

The retry pattern will make configured multiple retry attempts when a service has temporarily failed. This pattern is very helpful in the scenarios like network disruption where the client request may successful after a retry attempt.

Retry Logic: Determine when and how many times to retry an operation. This can be based on factors such as error codes, exceptions, or response status.

Backoff Strategy: Define a strategy for delaying retries to avoid overwhelming the system or exacerbating the underlying issue. This strategy can involve gradually increasing the delay between each retry, known as exponential backoff.

Circuit Breaker Integration: Consider combining the Retry pattern with the Circuit Breaker pattern. If a certain number of retries fail consecutively, the circuit can be opened to prevent further attempts and preserve system resources.

Idempotent Operations: Ensure that the retried operation is idempotent, meaning it produces the same result regardless of how many times it is invoked. This prevents unintended side effects or duplicate operations.

Steps to build a Retry Pattern using Spring Cloud Gateway filter:

1. Add Retry filter: Inside the method where we are creating a bean of RouteLocator, add a filter of retry as shown below

@Bean
public RouteLocator customRoutes(RouteLocatorBuilder routeBuilder) {
    return routeBuilder.routes()
        .route(routeSpec -> routeSpec.path("/myapp/services/**")
            .filters(filterSpec -> filterSpec.rewritePath("/myapp/services/(?<id>.*)", "/${id}")
                .addResponseHeader("X-Service-Time", new Date().toString())
                .retry(retrySpec -> retrySpec.setRetries(4)
                    .setMethods(HttpMethod.POST)
                    .setBackoff(Duration.ofMillis(200), Duration.ofMillis(2000), 3, true)))
            .uri("lb://SERVICE"))
        .build();
}

Steps to build a Retry Pattern using normal Spring Boot service:

1. Add Retry pattern annotations: Choose a method and mention the retry pattern-related annotation along with the below configuration. After that, create a fallback method matching the same method signature.

@Retry(name = "fetchServiceData", fallbackMethod = "fetchServiceDataFallback")
@GetMapping("/service-data")
public ResponseEntity<String> fetchServiceData() {
    // Your custom logic here
}

private ResponseEntity<String> fetchServiceDataFallback(Throwable ex) {
    // Fallback logic
    return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR).body("Service is currently down. Please try again later.");
}
  1. Add properties:
resilience4j:
  retry:
    configs:
      default:
        maxRetryAttempts: 3
        waitDuration: 500
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        retryExceptions:
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          - java.lang.NullPointerException
  1. maxRetryAttempts: This sets the maximum number of retry attempts for a failed operation. In this case, the system will retry up to 3 times before giving up.

  2. waitDuration: This specifies the time (in milliseconds) the system will wait between retry attempts. Here, it waits 500 milliseconds between each retry.

  3. enableExponentialBackoff: This enables exponential backoff, a retry strategy where the wait time between retries increases exponentially (rather than being constant). This reduces the chance of overwhelming the service with frequent retry attempts.

  4. exponentialBackoffMultiplier: When exponential backoff is enabled, this multiplier defines the rate at which the wait duration increases. In this case, the wait time will double with each retry (e.g., 500ms, 1000ms, 2000ms).

  5. retryExceptions: This section lists the exceptions that will trigger a retry. Here, the system will retry when a java.util.concurrent.TimeoutException occurs.

  6. ignoreExceptions: This lists the exceptions that should not trigger a retry. For example, if a java.lang.NullPointerException is thrown, the system will not retry the operation.

Rate Limiter Pattern

The Rate Limiter pattern in microservices regulates the flow of incoming requests to a service or API, ensuring fair usage, preventing abuse, and safeguarding system resources. In a microservices setup, where services communicate by sending requests, unchecked requests can cause issues like performance slowdowns, resource depletion, or even denial-of-service (DoS) attacks. The Rate Limiter pattern helps manage this by setting boundaries on request rates.

By applying this pattern, microservices are shielded from being overwhelmed by a flood of requests, whether intentional or not. It helps maintain system stability, performance, and availability while offering controlled access to resources. Rate limits ensure a fair experience for both providers and users.

When a user exceeds the allowed number of requests within a specific time, any further requests are blocked with an HTTP 429 - Too Many Requests response. Limits can be set based on criteria like session, IP address, user, or tenant. This pattern also supports resilience by prioritizing system availability during peak times or disruptions. Additionally, rate limits can vary based on service levels, with distinct rules for basic, premium, or enterprise users.

Steps to build a rate limiter pattern using Spring Cloud Gateway filter:

  1. Add Maven Dependency: Add the following Maven dependency in your pom.xml file to use Redis as the rate limiter storage. Also, ensure a Redis container is running, and mention Redis connection details inside the application.yml file.

     <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-data-redis-reactive</artifactId>
     </dependency>
    
    1. Add Rate Limiter Filter: Inside the method where we are creating a bean of RouteLocator, add a filter for the rate limiter as shown below, and create supporting beans of RedisRateLimiter and KeyResolver.
    @Bean
    public RouteLocator customRoutes(RouteLocatorBuilder routeBuilder) {
        return routeBuilder.routes()
            .route(r -> r.path("/myapp/services/**")
                .filters(fltr -> fltr.rewritePath("/myapp/services/(?<id>.*)", "/${id}")
                    .addResponseHeader("X-Request-Timestamp", new Date().toString())
                    .requestRateLimiter(rlConfig -> rlConfig
                        .setRateLimiter(redisRequestLimiter())
                        .setKeyResolver(customKeyResolver())))
                .uri("lb://SERVICES"))
            .build();
    }

    @Bean
    public RedisRateLimiter redisRequestLimiter() {
        return new RedisRateLimiter(2, 2);
    }

    @Bean
    public KeyResolver customKeyResolver() {
        return exchange -> Mono.justOrEmpty(exchange.getRequest().getHeaders().getFirst("client-id"))
            .defaultIfEmpty("guest");
    }

Rate Limiter Pattern using a normal Spring Boot service

  1. Add Rate Limiter Pattern Annotations: Choose a method and annotate it with the @RateLimiter annotation, including the fallback method for handling cases when the rate limit is exceeded. The fallback method should have the same signature as the original method.

     @RateLimiter(name = "getJavaVersion", fallbackMethod = "getJavaVersionFallback")
     @GetMapping("/java-version")
     public ResponseEntity<String> getJavaVersion() {
         // Logic for getting Java version
     }
    
     private ResponseEntity<String> getJavaVersionFallback(Throwable t) {
         // Fallback logic when rate limit is exceeded
         return ResponseEntity.status(HttpStatus.TOO_MANY_REQUESTS).body("Rate limit exceeded. Please try again later.");
     }
    
  2. Add Properties: Add the following properties to the application.yml file for configuring the rate limiter.

     resilience4j:
       ratelimiter:
         configs:
           default:
             timeoutDuration: 5000
             limitRefreshPeriod: 5000
             limitForPeriod: 1
    

Conclusion

Building resilient microservices is essential to ensuring the stability and reliability of distributed systems, especially in environments prone to failures and unpredictable conditions.

By implementing patterns like Circuit Breaker, Rate Limiter, and Timeouts we can prevent cascading failures, manage load effectively, and keep services operational even in the face of disruptions.

As microservices continue to evolve and scale, prioritizing resiliency is key to delivering a seamless user experience, maintaining system performance, and safeguarding against the vulnerabilities inherent in distributed architectures. With the right strategies in place, you can design a microservices ecosystem that not only handles failures gracefully but also thrives in complex, dynamic environments.

Resources

Resilience4j