2from agent
import BaseAgent
7 A greedy agent that occasionally explores.
9 This agent will primarily exploit when deciding its actions. However, it will occasionally choose to explore at a
10 rate of epsilon, which is provided at initialization. This gives it a chance to see if other actions are better
14 def __init__(self, k: int, epsilon: float, start_value: float = 0.0) ->
None:
18 @param k The number of actions to consider. This must be an int greater than zero.
19 @param epsilon The rate at which actions should randomly explore. As this is a probability, it should be between
21 @param start_value The initial value to use in the table. All actions start with the same value.
22 @exception ValueError if epsilon is not a valid probability (between 0 and 1).
24 super().
__init__(k, start_value=start_value)
29 self.
_rng = numpy.random.default_rng()
33 Determine which action to take.
35 This will explore randomly over the actions at a rate of epsilon and inversely will exploit based on table
36 values at a rate of (1.0 - epsilon).
37 @return The index of the selected action to take. Gauranteed to be an int on the range [0, k).
41 should_explore = (samples[0] == 1)
54 if value < 0.0
or value > 1.0:
56 'Epsilon must be a valid probability, so between 0 and 1 (inclusive)!')
59 def update(self, action: int, reward: float) ->
None:
61 Update the Q-table based on the last action.
63 This will use an incremental formulation of the mean of all rewards obtained so far as the values of the table.
64 @param action An index representing which action on the table was selected. It must be between [0, k).
65 @param reward The reward obtained from this action.
68 self.
table[action] += (reward - self.
table[action]) / self.
_n
A base class used to create a variety of bandit solving agents.
int explore(self)
Explore a new action.
int exploit(self)
Select the best action.
numpy.ndarray table(self)
Return the Q-Table.
A greedy agent that occasionally explores.
int act(self)
Determine which action to take.
None __init__(self, int k, float epsilon, float start_value=0.0)
Construct the agent.
None update(self, int action, float reward)
Update the Q-table based on the last action.
None epsilon(self, float value)