NeIL: Intelligent Replica Selection for Distributed Applications
Distributed applications such as cloud gaming, streaming, etc., are increasingly using edge-to-cloud infrastructure for high availability and performance. While edge infrastructure brings services closer to the end-user, the number of sites on which the services need to be replicated has also increa...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Transactions on Machine Learning in Communications and Networking |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10714467/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Distributed applications such as cloud gaming, streaming, etc., are increasingly using edge-to-cloud infrastructure for high availability and performance. While edge infrastructure brings services closer to the end-user, the number of sites on which the services need to be replicated has also increased. This makes replica selection challenging for clients of the replicated services. Traditional replica selection methods including anycast based methods and DNS re-directions are performance agnostic, and clients experience degraded network performance when network performance dynamics are not considered in replica selection. In this work, we present a client-side replica selection framework NeIL, that enables network performance aware replica selection. We propose to use bandits with experts based Multi-Armed Bandit (MAB) algorithms and adapt these algorithms for replica selection at individual clients without centralized coordination. We evaluate our approach using three different setups including a distributed Mininet setup where we use publicly available network performance data from the Measurement Lab (M-Lab) to emulate network conditions, a setup where we deploy replica servers on AWS, and finally we present results from a global enterprise deployment. Our experimental results show that in comparison to greedy selection, NeIL performs better than greedy for 45% of the time and better than or equal to greedy selection for 80% of the time resulting in a net gain in end-to-end network performance. On AWS, we see similar results where NeIL performs better than or equal to greedy for 75% of the time. We have successfully deployed NeIL in a global enterprise remote device management service with over 4000 client devices and our analysis shows that NeIL achieves significantly better tail service quality by cutting the <inline-formula> <tex-math notation="LaTeX">$99th$ </tex-math></inline-formula> percentile tail latency from 5.6 seconds to 1.7 seconds. |
|---|---|
| ISSN: | 2831-316X |