7 A base class used to create a variety of bandit solving agents.
9 This class provides a table that can be used to store reward estimates. It also defines the interface that any
10 agent must define when implemented. This ensures consistent API across each agent type.
13 def __init__(self, k: int, start_value: float = 0.0) ->
None:
17 @param k The number of possible actions the agent can pick from at any given time. Must be an int greater than
19 @param start_value An initial value to use for each possible action. This assumes that each action is equally
20 likely at start, so all values in the Q-table are set to this value.
21 @exception ValueError if k is not an integer greater than 0.
26 raise ValueError(
'k must be an integer greater than zero.')
27 self.
_table = start_value * numpy.ones(shape=(k,), dtype=numpy.float)
32 Use a specific algorithm to determine which action to take.
34 This method should define how exactly the agent selects an action. It is free to use @ref explore and @ref
36 @return An int representing which arm action to take. This int should be between [0, k).
41 Select the best action.
43 This will use the Q-table to select the action with the highest likelihood. Ties are broken arbitrarily.
44 @return An int representing which arm action to take. This int will be between [0, k).
48 possible_actions = numpy.argmax(a=self.table)
49 if possible_actions.size == 1:
50 selected_action = possible_actions
53 selected_action = numpy.random.choice(a=possible_actions)
54 return selected_action
60 This will select a random action to take from the Q-table, to explore the decision space more.
61 @return An int representing which arm action to take. This int will be between [0, k).
65 return numpy.random.choice(a=self.
table.size, size=1)
68 def table(self) -> numpy.ndarray:
71 @return a Numpy array of k elements. the i-th element holds the estimated value for the i-th action/arm.
76 def update(self, action: int, reward: float) ->
None:
80 This takes the result of the previous action and the resulting reward and should update the Q-Table. How it
81 updates will depend on the specific implementation.
82 @param action An int representing which arm action was taken. This should be between [0, k].
83 @param reward A float representing the resulting reward obtained from the selected action.
A base class used to create a variety of bandit solving agents.
None update(self, int action, float reward)
Update the Q-Table.
int explore(self)
Explore a new action.
int act(self)
Use a specific algorithm to determine which action to take.
int exploit(self)
Select the best action.
numpy.ndarray table(self)
Return the Q-Table.
None __init__(self, int k, float start_value=0.0)
Construct the agent.