Abstract
In this paper, we have proposed a framework to model the server nodes' non-stationary reward distribution as an N-state Markov chain derived from the long-term probability distribution of the reward, in which the state transitions occur at an interval defined as a non-homogeneous Poisson process. The framework aims to serve as a tool for validation and as a test-bench for comparing the performance of various learning algorithms in non-stationary bandits and similar applications where the server nodes' computational reward distribution is dynamic.