Unpacking Performance Variability in Deep Reinforcement Learning: The Role of Observation Space Divergence
Deep Reinforcement Learning (DRL) algorithms often exhibit significant performance variability across different training runs, even with identical settings. This paper investigates the hypothesis that a key contributor to this variability is the divergence in the observation spaces explored by indiv...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-07-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/15/8247 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Deep Reinforcement Learning (DRL) algorithms often exhibit significant performance variability across different training runs, even with identical settings. This paper investigates the hypothesis that a key contributor to this variability is the divergence in the observation spaces explored by individual learning agents. We conducted an empirical study using Proximal Policy Optimization (PPO) agents trained on eight Atari environments. We analyzed the collected agent trajectories by qualitatively visualizing and quantitatively measuring the divergence in their explored observation spaces. Furthermore, we cross-evaluated the learned actor and value networks, measuring the average absolute TD-error, the RMSE of value estimates, and the KL divergence between policies to assess their functional similarity. We also conducted experiments where agents were trained from identical network initializations to isolate the source of this divergence. Our findings reveal a strong correlation: environments with low-performance variance (e.g., Freeway) showed high similarity in explored observation spaces and learned networks across agents. Conversely, environments with high-performance variability (e.g., Boxing, Qbert) demonstrated significant divergence in both explored states and network functionalities. This pattern persisted even when agents started with identical network weights. These results suggest that differences in experiential trajectories, driven by the stochasticity of agent–environment interactions, lead to specialized agent policies and value functions, thereby contributing substantially to the observed inconsistencies in DRL performance. |
|---|---|
| ISSN: | 2076-3417 |