Service-Level Objective-Aware Load-Adaptive Timeout: Balancing Failure Rate and Latency in Microservices Communication
Microservices architectures enable scalable and modular application design but introduce reliability challenges due to their distributed nature. Timeout configurations are critical for maintaining system reliability, as they directly impact latency and failure rate Service-Level Objective (SLO) comp...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11113294/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Microservices architectures enable scalable and modular application design but introduce reliability challenges due to their distributed nature. Timeout configurations are critical for maintaining system reliability, as they directly impact latency and failure rate Service-Level Objective (SLO) compliance. However, current timeout settings are often based on best practices rather than systematic optimization, as determining the optimal timeout is challenging. The difficulty arises from the need to balance SLO constraints while adapting to dynamically changing load and system capacity, making static configurations inherently suboptimal. To address this, this study proposes a load-adaptive timeout mechanism that dynamically adjusts timeout values to optimize reliability across different load conditions. Under normal load, the method minimizes latency while maintaining failure rate SLO compliance. Under overload, where meeting both objectives becomes infeasible, it prioritizes failure rate reduction while ensuring latency SLO compliance. By allocating the initial portion of the timeout duration to transmission delay during downstream overload and failure, the method naturally exhibits load shedding and circuit-breaking behavior, preventing the bottleneck service from being overwhelmed. The proposed method was implemented as an open-source Go library and evaluated using the Online Boutique benchmark under various load conditions. Results show that it reduces average and tail latencies by 40% and 55%, respectively, under normal load and short-lived overload. Under prolonged overload, it minimizes failure rates, reducing deviations from the failure rate SLO by 18%. These findings demonstrate the effectiveness of adaptive timeout control in maintaining microservices reliability while dynamically responding to changing system conditions. |
|---|---|
| ISSN: | 2169-3536 |