Better Engineers

Better Engineers

Share this post

Better Engineers
Better Engineers
Designing Reliable Communication in Micro-services

Designing Reliable Communication in Micro-services

Better Engineering's avatar
Better Engineering
Mar 23, 2025
∙ Paid
24

Share this post

Better Engineers
Better Engineers
Designing Reliable Communication in Micro-services
8
Share

System Design Refresher :

  1. Rate Limiting Algorithms Explained with Code

  2. System Design of Reddit/Quora

  3. Stateful vs Stateless Architecture

  4. Best Practices for Developing Microservices

  5. 10 Problems of Distributed Systems

  6. 20 System Design Concepts Every Developer Should Know - Part - I

  7. How Shopify handles 16,000 Request per second

  8. Software Architecture Pattern - Layered Architecture

  9. How Enterprise Applications Exchange Data Using Messaging

  10. Microservices Design Pattern - Event Sourcing Pattern

  11. Improve API Performance 🚀

  12. Distributed System Learning Roadmap


If failure is inevitable, you need to design and build your services to maximize availability ,correct operation, and rapid recovery when failure does occur. This is fundamental to achieving resiliency. In this section, we’ll explore several techniques for ensuring that services behave appropriately maximizing correct operation when their collaborators are unavailable:

  • Retries

  • Fallbacks, caching, and graceful degradation

  • Timeouts and deadlines

  • Circuit breakers

  • Communication brokers

Retries

If the service A fails to retrieve data and returns an error, the calling service won't immediately know if the failure is:

  • Isolated → A one-time issue, where retrying the call might succeed.

  • Systemic → A larger problem, where retrying will likely fail again.

Since fetching data doesn't change anything on the server (idempotent operation), it is safe to retry the call without causing any unwanted side effects.

public class Client {

    private static final Logger logger = LoggerFactory.getLogger(MarketDataClient.class);
    private static final String BASE_URL = "http://market-data:8000";

    private final HttpClient httpClient;

    public Client() {
        httpClient = HttpClient.newBuilder()
                .version(HttpClient.Version.HTTP_2)
                .connectTimeout(java.time.Duration.ofSeconds(10))
                .build();
    }

    private JSONObject makeRequest(String url) throws IOException, InterruptedException {
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(BASE_URL + "/" + url))
                .header("Content-Type", "application/json")
                .GET()
                .build();

        HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
        
        return new JSONObject(response.body());
    }

    public JSONObject getAllPrices() {
        RetryConfig config = RetryConfig.custom()
                .maxAttempts(3)
                .waitDuration(java.time.Duration.ofMillis(1000))
                .build();

        Retry retry = Retry.of("marketDataRetry", config);

        return Retry.decorateCheckedSupplier(retry, () -> {
            logger.debug("Attempting to fetch prices...");
            return makeRequest("prices");
        }).apply();
    }

    public static void main(String[] args) {
        MarketDataClient client = new MarketDataClient();
        try {
            JSONObject prices = client.getAllPrices();
            System.out.println("Prices: " + prices.toString(2));
        } catch (Exception e) {
            logger.error("Failed to fetch prices after retries", e);
        }
    }
}

If the service B fails and returns a 500 error, the Service A will automatically try to fetch prices three times before giving up.

At this point, it’s hard to tell whether the failure is:

  • Temporary (isolated) → A minor glitch that could succeed on retry.

  • Persistent (systemic) → An ongoing issue, where retrying won’t help and may make things worse.

✅ If the failure is temporary, retrying is a good option. It reduces the impact on end users and avoids the need for manual intervention by the operations team.
⚠️ If the failure is persistent, retrying too many times can overload the market-data service. For example:

  • If you retry each failed request five times, the total number of requests multiplies.

  • This creates extra load on the service, making the problem worse and slowing everything down.

💡 The key is to balance your retry budget. If retries take too long, the calling service's response time becomes unreasonably high, affecting performance.

Retries are an effective strategy for tolerating intermittent dependency faults, but you need to use them carefully to avoid exacerbating the underlying issue or consuming unnecessary resources:

  • Always limit the total number of retries.

  • Use exponential back-off with jitter to smoothly distribute retry requests and avoid compounding load.

  • Consider which error conditions should trigger a retry and, therefore, which retries are unlikely to, or will never, succeed.


    Get 30% off for 1 year

Keep reading with a 7-day free trial

Subscribe to Better Engineers to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Dev Dhar
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share