Multi-armed bandit problem (MAB) is usually said to be exploitation vs exploration. There are a number of strategies to deal with MAB. Personaly, I like to treat MAB as portfolio theory with unknown expected value and variance, and the weights in portfolio correspond to probability of choosing that option.
Usually, we just want to maximize the return of MAB, but portfolio approach requires us to supply a minimum bound on return and minimize the variance. And the min-bound could be changed dynamically, IMO, this is nice.