System Design Refresher :
If failure is inevitable, you need to design and build your services to maximize availability ,correct operation, and rapid recovery when failure does occur. This is fundamental to achieving resiliency. In this section, we’ll explore several techniques for ensuring that services behave appropriately maximizing correct operation when their collaborators are unavailable:
Retries
Fallbacks, caching, and graceful degradation
Timeouts and deadlines
Circuit breakers
Communication brokers
Retries
If the service A fails to retrieve data and returns an error, the calling service won't immediately know if the failure is:
Isolated → A one-time issue, where retrying the call might succeed.
Systemic → A larger problem, where retrying will likely fail again.
Since fetching data doesn't change anything on the server (idempotent operation), it is safe to retry the call without causing any unwanted side effects.
public class Client {
private static final Logger logger = LoggerFactory.getLogger(MarketDataClient.class);
private static final String BASE_URL = "http://market-data:8000";
private final HttpClient httpClient;
public Client() {
httpClient = HttpClient.newBuilder()
.version(HttpClient.Version.HTTP_2)
.connectTimeout(java.time.Duration.ofSeconds(10))
.build();
}
private JSONObject makeRequest(String url) throws IOException, InterruptedException {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(BASE_URL + "/" + url))
.header("Content-Type", "application/json")
.GET()
.build();
HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
return new JSONObject(response.body());
}
public JSONObject getAllPrices() {
RetryConfig config = RetryConfig.custom()
.maxAttempts(3)
.waitDuration(java.time.Duration.ofMillis(1000))
.build();
Retry retry = Retry.of("marketDataRetry", config);
return Retry.decorateCheckedSupplier(retry, () -> {
logger.debug("Attempting to fetch prices...");
return makeRequest("prices");
}).apply();
}
public static void main(String[] args) {
MarketDataClient client = new MarketDataClient();
try {
JSONObject prices = client.getAllPrices();
System.out.println("Prices: " + prices.toString(2));
} catch (Exception e) {
logger.error("Failed to fetch prices after retries", e);
}
}
}
If the service B fails and returns a 500 error, the Service A will automatically try to fetch prices three times before giving up.
At this point, it’s hard to tell whether the failure is:
Temporary (isolated) → A minor glitch that could succeed on retry.
Persistent (systemic) → An ongoing issue, where retrying won’t help and may make things worse.
✅ If the failure is temporary, retrying is a good option. It reduces the impact on end users and avoids the need for manual intervention by the operations team.
⚠️ If the failure is persistent, retrying too many times can overload the market-data service. For example:
If you retry each failed request five times, the total number of requests multiplies.
This creates extra load on the service, making the problem worse and slowing everything down.
💡 The key is to balance your retry budget. If retries take too long, the calling service's response time becomes unreasonably high, affecting performance.
Retries are an effective strategy for tolerating intermittent dependency faults, but you need to use them carefully to avoid exacerbating the underlying issue or consuming unnecessary resources:
Always limit the total number of retries.
Use exponential back-off with jitter to smoothly distribute retry requests and avoid compounding load.
Consider which error conditions should trigger a retry and, therefore, which retries are unlikely to, or will never, succeed.
Keep reading with a 7-day free trial
Subscribe to Better Engineers to keep reading this post and get 7 days of free access to the full post archives.