Q-Learning and Collection Agents
Tom O'Neill
Leland Aldridge
Harry Glaser
CSC242, Dept. of Computer Science, University of Rochester
{toneill, hglaser, la002k}@mail.rochester.edu
Abstract
Reinforcement learning strategies allow for the creation of agents that can
adapt to unknown, complex environments. We attempted to create an
agent that would learn to explore an environment and collect the trash
within it. While the agent successfully explored and collected trash many
times, the training simulator inevitably crashed as the Q-learning algorithm
eventually failed to maintain the Q-value table within computable bounds.
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 1 of 16
4/21/2006
CSC242
Introduction
In complex environments, it is almost impossible to account and program
for every possible scenario than an agent may face. Reinforcement learning is
one solution to this problem because it allows for the creation of robust and
adaptive agents (Russell and Norvig(2)).
Our goal was to develop an agent that when placed in an unknown
environment, the agent would explore the environment, collect any trash found,
and finally return to its starting location once all of the trash had been collected.
We chose to develop a learning agent such that we could train it in many
simulated environments thus allowing it to learn the optimal behaviors of a trash
collection agent. Once the agent was trained, it was expected to be able execute
the learned behaviors in any relatively similar environment and perform well.
According to Russell and Norvig(2), two popular reinforcement strategies
are active and passive learning. In both cases agents make decisions based on
expected utilities, though they differ in the method in which the utility of states is
determined. We briefly explain the difference between these strategies to
motivate the decision to use active learning in our environment.
Passive Learning
With passive learning, agents require a model of the environment. This
model tells the agent what moves are legal and what the results of actions will
be. This is particularly useful because it allows the agent to look ahead and
make better choices about what actions should be taken. However, passive
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 2 of 16
4/21/2006
CSC242
learning agents have a fixed policy and this limits their ability to adapt to or
operate in unknown environments.
Active Learning
Unlike passive learning agents, active learning agents do not have a fixed
policy. Active learning agents must learn a complete model of the environment.
This means that the agent must determine what actions are possible at any given
state since it is building a model of its environment and does not yet have an
optimal policy. This allows active learning agents to learn how to effectively
operate in environments that are initially unknown. However, the lack of a fixed
policy slows the rate at which the agent learns the optimal behaviors in its
environment.
Motivation for Active Learning
We chose to explore the behavior of a collection agent in an unknown
environment. The agent was expected to learn how to optimally move around
the environment while collecting trash, and ultimately make its way back to the
starting point once all of the trash was collected.
Since the shape of the environment, trash distribution, and starting
location of the agent are all unknown to the agent, we chose to use an active
learning reinforcement technique called Q-learning.
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 3 of 16
4/21/2006
CSC242
The Q-Learning Algorithm
Since Q-learning is an active reinforcement technique, it generates and
improves the agent’s policy on the fly. The Q-learning algorithm works by
estimating the values of state-action pairs. The purpose of Q-learning is to
generate the Q-table, Q(s,a), which uses state-action pairs to index a Q-value, or
expected utility of that pair. The Q-value is defined as the expected discounted
future payoff of taking action a in state s, assuming the agent continues to follow
the optimal policy (Russell and Norvig(2)).
Q-learning generates the Q-table by performing as many actions in the
environment as possible. This initial Q-table generation is usually done offline in
a simulator to allow for many trials to be completed quickly. The update rule for
setting values in the table is as follows in Eqation-1:
Q[ s, a] ← Q[ s, a] + α ( N sa [ s, a])(r + γ max Q[ s' , a ' ] − Q[ s, a])
a'
Eqation-1
In Equation 1, α is the learning factor and γ is the discount factor. These
values are positive decimals less than 1 and are set through experimentation to
affect the rate at which the agent attempts to learn the environment. Nsa[s,a] is a
table of frequencies indexed by the same state-action pair, the value of the table
is equivalent to the number of times the agent has attempted that state-action
pair. The variables s and a represent the current state and action of the agent,
respectively. Finally, r is the reward from performing s’ and a’, the previous state
and action, respectively. (Russell and Norvig(2))
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 4 of 16
4/21/2006
CSC242
The initial entries of Nsa[s,a] are 0 for all pairs, and the initial values for
Q[s,a] are arbitrary. The values of Q[s,a] can be arbitrary because the effect of
this update rule after many trials is to essentially average neighboring expected
utilities of state-action pairs to a smooth and near-optimal dynamically generated
policy.
Simulator Design
We developed a simulator to train the agent offline in many different
environments. The simulator generated randomized environments, maintained
the structures necessary for Q-learning, and provided appropriate feedback to
the agent such that it could learn.
The Agent and Environment
In every case, the agent explored and collected trash in an N by N
obstacle-free grid world. The agent always started at position (0, 0) and could
move north, south, east, and west, and pickup trash. The agent did not know
where it was in the grid world, nor could it tell if it was at a boundary of the world.
Additionally, the agent did not have any sight – it could not tell if trash was
nearby beyond receiving a reward for successfully picking up trash (assuming it
existed in the agent’s grid square). Actions always executed successfully with
the expected outcome, with the exception that moving into a wall at a boundary
square did not move the agent, and picking up trash in a clean square did not
trigger a reward.
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 5 of 16
4/21/2006
CSC242
Trash was distributed randomly throughout the grid world each trial with a
20% chance of any cell having trash in it. The results in this paper were
generated using a 4 by 4 grid, though both the percent of trash and size of the
grid are arbitrary.
Rewards
The agent received rewards of 20 points for collecting trash and 50 points
for returning home after all trash had been collected. For each movement or
empty pickup attempt, the agent was penalized 1 point. These values were
chosen arbitrarily and varied during experimentation in an attempt to produce
optimal behavior.
Simulator Implementation
This section describes some of the important implementation details of the
simulator. The complete implementation of the simulator, written in python, can
be found in the in the appendix of this paper.
In this simulator, a state is defined as a 3-tuple of:
(currentPosition, trashAtLocation, numTrashCollected)
•
currentPosition is a 2-tuple of integers representing of the agent’s grid
location (x, y)
•
trashAtLocation is a Boolean, True if there is/was trash at the agent’s
current position and False otherwise
•
numTrashCollected is a integer representing the number of trash items
collected thus far
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 6 of 16
4/21/2006
CSC242
We used this style of state description so that we could account for each
important dimension governing the agent’s behavior. The agent needs to learn
the value of each grid square as it varies with the number of each trash items
collected and whether or not the agent has already found trash at that location.
Main Loop
The entire simulator was driven by a short loop that evaluated the
previous action and then chooses the best available action, performs the action,
and then repeats.
Main Loop of TrashMan.py
Lines 2 and 3 initialize the starting position and value of the first state such
that lines 6 and 7 can repeatedly generate the explore and collect policy of the
agent. Line 5 acts as a measure of performance for each trial of the agent, fewer
actions are expected on each trial as the agent’s policy gets closer to the optimal
policy for this environment.
Taking Actions
Taking an action must update the agent’s state, but it cannot update the
environment until after the Q-learning function has had a chance to analyze the
rewards associated with that action in the environment. Additionally, the only
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 7 of 16
4/21/2006
CSC242
chance to calculate rewards in the simulator is when the agent takes actions.
Thus the combined method for taking actions and calculating rewards:
Action Taking and Reward Calculating Function
Line 5 shows the default reward for any action is the penalty for being
alive. If the agent successfully collects trash (lines 6-8) or is at the starting
location after collecting all of the disbursed trash (lines 23-24), then the agent is
rewarded instead of penalized. Finally, on each of the agent’s movement actions
the position of the agent is checked and the action is ignored if it would move the
agent outside the environment.
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 8 of 16
4/21/2006
CSC242
Learning
By far the most complicated function in the simulator is the function that
implements the Q-learning aspect of the agent’s behavior. The function must
first compute the value of taking that action based on the received reward and
then calculate the next best action to take. Finally the environment must be
updated to reflect the results of the agent’s actions before the next action can be
taken.
Agent’s Q-Learning Function (Part 1 of 2)
First, the function checks to see if that state-action pair has ever been
tried before and if it hasn’t, the state-action pair is initialized with an arbitrary
value in the Q-table (lines 5-6). Likewise, if the frequency table does not have a
value for the state-action pair, the value is initialized at 0 (lines 8-9). Once the
two important tables are set, the frequency of the state-action pair is incremented
(line 12) before our implementation of Equation 1 is executed.
Line 15 of this function is probably the most important line in the simulator
– it’s where the learning takes place! All of it should look familiar from Equation
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 9 of 16
4/21/2006
CSC242
1, with the exception of getMaxQ(state) which merely returns the highest
possible Q-value for an action in the specified state.
Agent’s Q-Learning Function (Part 2 of 2)
After the completion of line 15, the environment can be updated to reflect
the result of the actions. If the agent picked up trash (lines 20-23), then the trash
must be removed from that location and the relevant accounting variables
updated. If the agent completed its task, then the world must be reset with a new
trash grid and the agent’s action history must be cleared (lines 25-32).
Otherwise, the function sets up the state and action variables so that the function
can be called again after the action it returns is executed.
Results
After running the simulator for many trials with many different values for
the constants (α, γ, N, reward scale values, exploration tendency, etc) we could
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 10 of 16
4/21/2006
CSC242
not get the Q-value table to converge. In every configuration, the simulator
would oscillate between two state-action pairs and repeatedly increase each
pair’s Q-value. Since each member of the oscillation loop was the other
member’s previous state-action pair, the rate of growth for both Q-values was
exponential.
The following two graphs show the exponential growth of the Q-values.
The X-axis is in a log10 scale. As shown, both oscillating cases grew their Qvalues to over 10250 in less than 350 action steps before crashing the simulator.
East/West Loop
350
300
Actions
250
200
150
100
50
0
1
16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286
Log(Q(s,a))
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 11 of 16
4/21/2006
CSC242
North/South Loop
350
300
Actions
250
200
150
100
50
0
1
14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248
Log(Q(s,a))
Due to diverging behavior of the Q-values in the Q-table, we were unable
to construct an agent that learned a useable collection strategy and hence
cannot report on the results of a meaningful policy.
Discussion and Future Work
It’s important to note that the simulator’s Q-table did not
immediately diverge. The agent was able to successfully complete its task many
times before the seemingly inevitable divergence. It’s unclear whether or nor the
agent successfully learned even part of an optimal policy before the Q-table
diverged.
The divergence of the Q-value table was not entirely unexpected.
Gordon(1) and Weiring(3) have separately discussed the nature of Q-learning and
the inability to guarantee its convergence in many circumstances. Gordon
specifically addresses the problem of oscillations leading to divergence and
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 12 of 16
4/21/2006
CSC242
Weiring discusses how off-policy reinforcement learning methods often cause the
system to diverge.
One possible reason for the divergence in our system is that the
environment is too dynamic for Q-learning’s standard update style. The standard
update can result in very fast swings of Q-values, and thus quickly break the
averaging nature of Q-learning.
We believe that the Q-learning algorithm for active reinforcement learning
in this environment may not have been the correct algorithm to use, specifically
because of the diverging results we encountered.
Other active learning algorithms with stronger convergence properties and
averaging updating methods instead of standard updating methods may fair
better in this environment. Additionally, it’s very possible to describe this
environment with different state-action pairs that could be better suited for
learning then those currently used by our simulator.
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 13 of 16
4/21/2006
CSC242
Distribution of Work
•
Tom O’Neill wrote this paper
•
Leland Aldridge was the mathematical brains behind understanding and
implementing Q-learning
•
Harry Glaser coded most of the simulator and read parables about
learning during debugging sessions
Notes to TAs
CB suggested we deviate from the project spec and pursue Q-learning
applied to an agent in a garbage collection environment - given the amount of
passive learning code already supplied made the “hook it up to quagents and
run it” significantly less of a learning project than implementing something like
Q-learning. Thanks!
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 14 of 16
4/21/2006
CSC242
References
1. Gordon, Reinforcement learning with function approximation converges to a
region. Advances in Neural Information Processing Systems. The MIT Press,
2001. http://citeseer.ist.psu.edu/gordon01reinforcement.html
2. Russell and Norvig, Artificial Intelligence A Modern Approach, Second
Edition. Prentice Hall, 2003.
3. Weiring, Convergence and Divergence in Standard and Averaging
Reinforcement Learning. Intelligent Systems Group, Institute of Information
and Computer Sciences, Utrecht University.
http://www.cs.uu.nl/people/marco/GROUP/ARTICLES/ecml_rl_convergence.p
df 4/21/2006.
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 15 of 16
4/21/2006
CSC242
Appendix: TrashMan.py Source Code
Tom O'Neill, Glaser, Aldridge
Q-Learning and Collection Agents
Page 16 of 16
4/21/2006
CSC242
TrashMan.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Q-learning agent for "trash world"
#
# by Leland Aldridge, Harry Glaser, Tom O'Neill
# for CSC 242 Project 4: Learning
# 4/21/2006
import sys, random
# useful utilities
def debug(s):
if DEBUG:
print s
def enum(*args):
g = globals()
i = 0
for arg in args:
g[arg] = i
i += 1
return i
# GLOBALS #####################################################################
# Macros
NUM_ACTIONS =
enum('ACTION_PICKUP',
'ACTION_NORTH',
'ACTION_EAST',
'ACTION_SOUTH',
'ACTION_WEST')
DEBUG = False
GRID_LENGTH = 4
THRESH = .2
TRASH_CHAR = '1'
CLEAR_CHAR = '-'
Q_INIT = 0
# Learning tables
Q = {}
# indexed by (a, s) utility of taking action a in state s
Nsa = {} # indexed by (a, s) number of times action a has been taken when in state s
trash = {}
# Learning parameters
ALPHA = 0.5
GAMMA = 0.5
N = 5
# learning factor
# discount factor
# number of times to try an action to get a good feel for it
# globals
trashes = 0
picked = 0
curPos = (0, 0)
grid = []
actions = 0
# num trashes left
# num trashes collected
# current position
# the world
# num actions taken so far
D:\My Documents\School\2005-2006\CSC242\Projects\Project_04-QLearning\TrashMan.py - Page 1 -
TrashMan.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# remembered information
prevState = None
prevAction = None
prevReward = None
# Utilities
REWARD_SUCCESS = 50
REWARD_PICKUP = 20
PENALTY_ALIVE = -1
# read grid from file
def readGrid(filename = 'quagent.itemgrid'):
line_num = 0
itemGrid = []
for i in range(GRID_LENGTH):
itemGrid += [[]]
for j in range(GRID_LENGTH):
itemGrid[i] += [False]
infile = open(filename)
i = -1
for line in infile:
i += 1
j = -1
if i >= GRID_LENGTH:
break
for c in line:
j += 1
if j >= GRID_LENGTH:
break
if c == TRASH_CHAR:
itemGrid[i][j] = True
elif c != CLEAR_CHAR:
sys.stderr.write('Error: invalid token, line ' + str(i) + ', col ' + str
(j) + ': ' + c + '\n')
return itemGrid
# construct random grid
def makeGrid():
global trashes
itemGrid = []
for i in range(GRID_LENGTH):
itemGrid += [[]]
for j in range(GRID_LENGTH):
if random.random() < THRESH:
itemGrid[i] += [True]
trashes+=1
else:
itemGrid[i] += [False]
return itemGrid
# print grid to stdout
D:\My Documents\School\2005-2006\CSC242\Projects\Project_04-QLearning\TrashMan.py - Page 2 -
TrashMan.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def printGrid(itemGrid):
for i in itemGrid:
for j in i:
sys.stdout.write(str(j) + ' ')
sys.stdout.write('\n')
# take the specified action: returns the reward for that action
def takeAction(action):
global curPos, trashes, grid, picked, actions
x, y = curPos
reward = PENALTY_ALIVE
if action == ACTION_PICKUP:
if grid[curPos[0]][curPos[1]]:
reward = REWARD_PICKUP
elif action == ACTION_NORTH:
if y < GRID_LENGTH - 1:
y += 1
elif action == ACTION_EAST:
if x < GRID_LENGTH - 1:
x += 1
elif action == ACTION_SOUTH:
if y > 0:
y -= 1
elif action == ACTION_WEST:
if x > 0:
x -= 1
curPos = (x, y)
if curPos == (0, 0) and trashes == 0 and reward != REWARD_PICKUP:
reward = REWARD_SUCCESS
return reward
# returns the highest possible Q-value for an action in the specified state
def getMaxQ(state):
action = None
max = 0
for curAction in range(NUM_ACTIONS):
key = (state, curAction)
if not Q.has_key(key):
Q[key] = Q_INIT
if action == None:
action = curAction
max = Q[key]
else:
n = Q[key]
if n >= max:
action = curAction
max = n
return max
D:\My Documents\School\2005-2006\CSC242\Projects\Project_04-QLearning\TrashMan.py - Page 3 -
TrashMan.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# returns the current reward for exploring further
def exploration(utility, frequency):
if frequency < N:
return REWARD_SUCCESS
else:
return utility
# determines the best action to take in the current state
def argmaxexplo(state):
bestaction = -1
oldmaxexplo = 0
for action in range(NUM_ACTIONS):
key = (state, action)
if not Q.has_key(key):
Q[key] = Q_INIT
if not Nsa.has_key(key):
Nsa[key] = 0
e = exploration(Q[key], Nsa[key])
if bestaction == -1:
oldmaxexplo = e
bestaction = action
if e > oldmaxexplo:
oldmaxexplo = e
bestaction = action
return bestaction
# Q-learning function
def Qlearning(reward, state):
global prevState, prevAction, prevReward, trashes, grid, actions, picked
if prevState != None:
if not Q.has_key((prevState, prevAction)):
Q[(prevState, prevAction)] = Q_INIT
if not Nsa.has_key((prevState, prevAction)):
Nsa[(prevState, prevAction)] = 0
# update visited states
Nsa[(prevState, prevAction)] += 1
# Q-learning equation
Q[(prevState, prevAction)] += ALPHA * Nsa[(prevState, prevAction)] * (reward +
GAMMA * getMaxQ(state) - Q[(prevState, prevAction)])
#debug:print prevState, prevAction, Q[(prevState, prevAction)]
# updates
pos = state[0]
if reward == REWARD_PICKUP:
grid[curPos[0]][curPos[1]] = False
trashes -= 1
picked += 1
D:\My Documents\School\2005-2006\CSC242\Projects\Project_04-QLearning\TrashMan.py - Page 4 -
TrashMan.py
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
elif reward == REWARD_SUCCESS:
prevState = None
prevAction = None
prevReward = None
print "GRID CLEAN! actions:", actions
actions = 0
grid = makeGrid()
picked = 0
else:
prevState = state
prevAction = argmaxexplo(state)
prevReward = reward
return prevAction
# init and main loop
def run():
global trashes, curPos, grid, picked, actions, ALPHA, GAMMA, N
# init
grid = makeGrid()
printGrid(grid)
ALPHA = float(sys.argv[1])
GAMMA = float(sys.argv[2])
N = int(sys.argv[3])
print "ALPHA: ", ALPHA, "\nGAMMA: ", GAMMA, "\nN: ", N, "\nGRID_LENGTH: ",
GRID_LENGTH
# main loop
firststate = ((0,0),grid[0][0], 0)
action = Qlearning(Q_INIT, firststate)
while True:
actions += 1
reward = takeAction(action)
action = Qlearning(reward, (curPos, grid[curPos[0]][curPos[1]], picked))
run()
D:\My Documents\School\2005-2006\CSC242\Projects\Project_04-QLearning\TrashMan.py - Page 5 -
© Copyright 2026 Paperzz