Learning in Multiagent Systems

Learning in Multiagent
Systems
Why use Multi-agent Systems?
4 main reasons:
●
●
●
●
Robustness
Efficiency
Reconfigurability
Scalability
Robustness
While non-multiagent systems will fail as a whole upon failure of any specific part,
multi-agent systems can compensate.
Each agent is an independent piece of the system, so an agent’s failure does not
affect the rest of the agents very much.
Efficiency
Each agent runs in its own environment. This could easily be an entirely different
server than any of the other agents.
By giving each agent its own resources, we get parallelism as a side effect of the
paradigm we are using.
Reconfigurability
Because agents are independent, it is easy to reconfigure a system.
Each agent reacts to its environment. They cannot directly see the actions of other
agents, so if an agent is removed, added, or changed, their environment remains
the same.
Scalability
Multiagent systems benefit from the fact that each of them can be run
independently of the others.
Because of this, multiagent systems are very scalable, with the ability to add
agents at any time.
However, if all the agents are reading from the same environment (ie a database),
there could eventually be bottlenecking at the database level. This is not different
from an object-oriented system.
3 Big Questions About Learning
What data are we looking at?
How are we assigning rewards?
When have we learned enough? Is it time to start exploiting what we’ve learned?
Single Agent Learning
This is a model of a non-MAS learning system. The program does actions on the
data, and receives back the results of those actions (or the “correct”) action to
learn. The program applies whatever learning algorithm it uses to that information
and continues on.
Single Agent Learning
Characteristics:
The program is given information that never changes.
This produces a consistent answer.
This converges to a solution over time.
UFO Game
In UFO Game, the program provides complete information to the learning
algorithm.
The learning works because the data correct answer for a given set of data is the
same every time.
Multiagent Learning
This is a model of a multi-agent system.
Each agent can perform actions that
change the data.
Each agent has a task that it is trying to
perform.
Multiagent Learning
Characteristics:
Information is not static, because the data
can be changed by other agents.
Convergence is not guaranteed, because
the best solution may change based on
each agent’s actions.
Alternative Model
An Alternative Model:
If the data could be represented
this way, then the Agent could
have a total picture of the
information.
Changes made by other agents
could be directly observed.
Data would be effectively static,
because all changes would be
accounted for.
Alternative Model
Problems with this model:
Often, other agent’s actions cannot
be directly observed.
If they can be observed, the
observation eliminates the
robustness and reconfigurability of
the system, because each agent
would have to be altered for
different configurations of other
agents.
Alternative Model
Problems with this model:
For each agent:
For each possible agent’s action:
Dimensions = dimensions + 1
This raises our dimensionality
drastically. We want a method of
learning that leaves our dimensionality
close to where it is.
Challenges Posed by MAL
The Credit Assignment Problem:
If the actions of other agents cannot be observed, how can we tell whose action caused an effect?
Example:
4 agents have a random byte. There is a single random byte for all agents to act on. All agents must use a
logical operation (AND, OR, NOR, NAND, XOR) on the byte. Agent 1’s goal is to make bit 1 of the byte a
1.
All agents decide on their actions, which are then executed in a random order, and the result is sent back
to the agents.
If Agent 1 observes that bit 1 is a 1, then how can Agent 1 tell if his operation made the bit a 1? If he had
picked a different action, would the bit still be a 1?
Challenges Posed by MAL
With instant feedback, the credit assignment problem is solved. The state can be
observed immediately before and after an action to determine the changes caused
by the action. However, there are few real-world problems where this is the case.
With delayed feedback and multiple agents, it is impossible to assign credit to any
agent about the results of their actions. In the time between acting and observing,
any or all of the other agents may change the state of the data.
Challenges Posed by MAL
Agents can only learn through the rewards they receive. If credit cannot be given
to an agent for the results of their action, they cannot gain rewards and cannot
learn.
How can we assign rewards?
Rewards in Cooperative Systems
Terms for Reward Systems
Alignment - the correlation between system-wide rewards and local rewards
Also called “Factoredness”
A reward system with high alignment gives high local rewards when high
systemwide rewards are given.
With high alignment, “When the system works, I get a reward”
Sensitivity - the correlation between an agent’s actions and it’s total reward
With high sensitivity, “When I do my part, I get a reward”
System Reward Only
This can work, but it works best with very few agents (best with 1, in fact).
Agents with delayed effects have a very hard time learning.
Local Reward Only
This is perfectly learnable for individual agents.
There is low alignment, so agents are not rewarded significantly for the finished
product.
Problems:
Let’s say the agents learn at different rates. The slow learners will bottleneck the
fast ones.
What if the problem was the bit flipping example instead of the assembly line?
Each agent may sabotage each other accidentally trying to get local rewards.
Difference Reward
Agents are rewarded as follows:
Reward = (Reward after action) - (Reward after a simulated null action)
This reward allows each agent to be rewarded by their contribution to the total
system reward.
Problems: Simulations may be imperfect or impossible. If you can simulate the
effects of your actions, you may as well simulate all of your actions and pick the
highest rewarding action every time. In real life, this is impossible for most
problems.
Action-Value Learning
Agents can remember the rewards obtained for each of its actions.
Let’s say you have n slot machines, each with different odds. How can you learn
which is the best to use?
The easiest way to learn is to try each slot machine many times, and record the
results of every pull.
By taking the average rewards over a large number of pulls, you can determine
which slot machine has the highest payout.
Multiple Agents in Action-Value Learning: The Bar
Problem
There are b bars in a city, each with a different capacity, cb. There are n agents in
the city, who are thirsty from a hard day’s work.
Each must choose a bar to go to. If the bar is close to empty, the rewards the
agents get is small. If there the bar is too crowded, then the reward is also small.
But if the bar has cb people in it, the reward is high.
How can Action-Value Learning solve this problem?
Direct Policy Adjustment
Direct policy adjustment is an alternative to action-value learning.
Each agent has a “policy”, which is the agent’s actions, each with a probability of
that action being chosen. Agents remember their policy at all time steps (t).
Rewards are normalized to 0 <= reward <= 1
For each time step t:
Action = Choose action based on Policy[t]
Policy[t+1][action] = reward + (1-reward)*Policy[t][action]
For each unchosen action a_not:
Policy[t+1] = (1-reward)*Policy[t][a_not]
Direct Policy Adjustment
While learning:
For each time step t:
Action = Choose action based on Policy[t]
Policy[t+1][action] = reward + (1-reward)*Policy[t][action]
For each unchosen action a_not:
Policy[t+1] = (1-reward)*Policy[t][a_not]
So, for a reward of 1, the policy becomes 1 for that action and 0 for the rest.
For a reward of 0, the policy becomes 0 for that action and 1 for the rest.
For a reward that’s large, Policy[t+1][a] goes up and Policy[t+1][a_not] goes down.
For a reward that’s small, Policy[t+1][a] goes down and Policy[t+1][a_not] goes up.
Flaws in Direct Policy Adjustment
Can easily get stuck in a local maximum, such as in the bar problem. It can
convince itself that there is some optimal sequence of bars to go to each night,
when really it depends on the actions of the other agents.
Neither technique shown here will necessarily, nor should they, converge to a
perfect solution. This is because the rewards for actions are largely based on the
actions of other agents. This mirrors many real world problems.
Markov Decision Processes - Notation
MDPs are a Tuple <S, A, T, R>
S: finite set of (S)tates
A: finite set of (A)ctions
T: a set of probabilities, ie the probability that taking action a1 from state s1 leads
to s2
R: the rewards for taking each action from each state
Markov Decision Processes - Notation
MDPs are a Tuple <S, A, T, R>
S: finite set of (S)tates
A: finite set of (A)ctions
T: a set of probabilities, ie the probability that taking action a1 from state s1 leads
to s2
R: the rewards for taking each action from each state
MDP’s - The Goal
Find a policy (π) that has the greatest possible long term rewards.
To do this, you need to be able to rank policies according to their possible
rewards.
We use value functions to rank policies.
Value Functions
This value function computes the expected value of s through t=[0-inf]
Value of π for starting state s = Sum(from t=0 to t=inf)(Y^t * Reward(s at t))
0 <= Y < 1
Y is set by the user. If you set Y to .999, the sum will converge at a very high t
value, but if you set Y to .1, it will converge at a low t value.
Value Functions
Value of π for starting state s = Sum(from t=0 to t=inf)(Y^t * Reward(s at t))
0 <= Y < 1
Y is set by the user. If you set Y to .999, the sum will converge at a very high t
value, but if you set Y to .1, it will converge at a low t value.
When is Learning Finished?
If algorithms learn based purely on short term rewards (“Greedy learning”), they
can converge on a solution that’s bad in the long term.
Example: There is a split in the manufacturing line, because 1 robot can do it’s job
twice as fast as the next robot in the line. This robot passes off all of it’s work to
only one of these lines. It’s local rewards remain the same, but the work stacks up
and long term rewards are only half as good as they could be.
When is Learning Finished?
So that machines can continue to learn past a short term optimal solution, we can
use ε-learning.
In ε-learning, instead of always choosing the predicted “best” action, a random
action may be picked instead with probability ε.
Even if we are 99% sure an action is optimal, if ε is 10%, we will pick a random
option 10% of the time.
This allows our value functions to continue to be updated in the long term.
Q-Learning
Q learning aims to find a long term the value of each action a at each individual
state s. To find these Q-values, it uses the following algorithm:
s = random_state
let Q(s,a) = random()
α = .01 # learning rate, usually small
for t in time:
Action = ε_choice(Q)
S_prime, r = do_action(a)
Q(s,a) = (1-α)*Q(s,a) + α*(r + y*max(Q(s_prime, action@s_prime)) - Q(s,a))
s = s_prime
In that equation, y is the discount rate, which could be defined as the relative value
of future actions.
Q-Learning
Why use Q-learning?
It is policy independent. This means that no matter what policy you begin with, it
will converge to the best solution.
Q-learning is not dependent on a model.
Why not use it?
Calculating max(Q(s_prime, action@s_prime)) is difficult and time-intensive. In
problems with large state spaces, this may be impossible.
RTDP
RTDP stands for Real Time Dynamic Programming
RTDP is a learning technique that uses a model of the problem to converge
quickly to a solution.
By using a model, RTDP can estimate the value of its actions without interacting
with the actual problem. This can help learning in cases where interactions can be
difficult or costly.
Example: by using a model of the stock market, or a mirror of it, you can train a
learning model without wasting money on an untrained algorithm.
RTDP
v = [random values for each state s]
r(s, a) = random values for each action at each state
p = the probability that each action a at each state s goes to state s_prime
For each time t:
a = ε_choose(s) # has access to v, r and p
reward, s_prime = do_action(a)
r(s,a) = reward
p(s_prime given (s,a)) is adjusted according to results so far
v(s) = v(s) + α*(max_a(r(s,a) + y*(sum of v(other states) * their probability)) v(s))
s = s_prime
The v(s) = … line can be summarized as “The value of s is set to the value of s plus
the learning rate times the difference between the reward earned and the maximum
other possible rewards, weighted by their probability.” Still a mouthful though.
Multiagent MDP’s
The algorithms presented this far have all been algorithms for single-agent
learning. Agents with these learning algorithms can be combined in many
ways. By sharing value functions v(s) or Q-value functions Q(s,a), they can
pool their resources to learn faster.
Many joint action learners that share a state and a reward. Their objective is to
find a policy mapping states to actions to maximize their team rewards.
Example:
Independent action learners with a team reward. Each agent finds their own
policy mapping the system state to their own action to maximize team rewards.
Independent learners with local rewards. Each agent maps s to a to maximize
their individual rewards.
Multiagent MDP’s
Local state, independent action with local reward. Each agent maps s to a to
maximize their local rewards. Value functions are shared. Example: Google
arms https://youtu.be/V05SuCSRAtg
Local state, independent action with team reward. Each agent maps s to a to
maximize team rewards.
Joint Action Learning
Q-Learning agent that learns Q-values for joint actions rather than individual
actions.
For JAL, Q is a function of an a-vector holding the decisions of all agents.
For joint action learning to take place, all agents must be able to see the actions of
all other agents.
Learning Equation: Q(a) = Q(a) + α*(r-Q(a))
“The value of a is set to the value of a plus the learning rate times the difference
between the reward and the expected reward.”
Joint Action Learning
This works because the games are stateless.
However, instead of learning the value of a single action, the values of all agents’
actions is being learned, [a1, a2, a3, …] rather than a.
Consider matrix games:
B0 B1
Each tuple (Ax, Bx) now has it’s own Q-value and probability
A0 r1
r2
given the action of B, and A can decide based on the frequency
A1 r3
r4
that B plays 0 or 1 (without knowing B’s rewards).
Nash-Q Learning
In Nash-Q Learning, each agent observes not only the actions of the other agents,
but their rewards as well. By doing this, they can calculate the Q-values for all
agents.
Agents then update their Q-values based on the discounted rewards for the game
assuming that their strategy was permanent.
Gradient Ascent Algorithms
Up until now, we’ve considered α, the learning rate, to be a constant.
However, there are many add-ons to the algorithms we’ve learned, where α
correlates directly to the difference between the learned algorithm and some
simple strategy. For example, a computer learning tic-tac-toe may have its results
compared to an algorithm that chooses a corner to start and then plays randomly.
If it does better than that algorithm, α will be small. If it does worse, α will be large.
Real World Examples of MAL
Ants have very little individual intelligence, but with simple communication tools,
they can learn to form efficient societies:
https://www.youtube.com/watch?v=vG-QZOTc5_Q
Simulation of recruitment:
http://alexbelezjaks.com/works/ant-colony-simulation/
Ant-based Algorithms
Shortest Path:
To find the shortest path between two
points, use many agents who drop
pheromones at intervals as they traverse
their space. Once they find the second
point, they follow their path back, dropping
pheromones. Each agent can be
influenced to follow a path if it has more
pheromones. The shortest paths require
the fewest pheromones, and will
eventually dominate traffic.
Ant-based Algorithms
Traveling Salesman Problem:
The traveling salesman problem can be seen as an extension of the shortest
path algorithm. Each agent’s purpose is to find a path between each
significant node on a graph. By marking each edge with pheromones, given
them a “weight”, the large number of ants will prefer each of the individual
paths between points, eventually converging to a solution.
https://www.youtube.com/watch?v=QZnvRXRsHHg
Air Traffic Control
Air Traffic control aims to maximize the movement of people from place to place,
without allowing the amount of airplanes in any given area to exceed the legal limit
of air traffic. The altitude of each individual airplane must be adjusted to avoid
collisions with other airplanes.
The system must be able to adjust in a reasonable amount of time to weather and
other circumstances.
Actions Available
These actions are available to control air traffic:
An airplane may be delayed.
An airplane’s altitude may be adjusted.
An airplane may be re-routed through another path.
An airplane’s speed may be changed.
Sites Available to house Agents
Any set location in geography, including individual
airports and individual “fixes”. The agents may
also be housed in the planes themselves.
A “fix” in air traffic terminology is an intersection
between two different routes between cities:
Information available to the Agents
Agents can see anything in range of their radar, and they can communicate via the
radio to learn more information. The more information they have, the longer it
takes to learn, so be selective.
How would we design this system?
Decide how many types of agents there are.
Decide where to house the agents.
Define the actions for an agent.
Define the rewards for an agent.