Learning in Multiagent Systems Why use Multi-agent Systems? 4 main reasons: ● ● ● ● Robustness Efficiency Reconfigurability Scalability Robustness While non-multiagent systems will fail as a whole upon failure of any specific part, multi-agent systems can compensate. Each agent is an independent piece of the system, so an agent’s failure does not affect the rest of the agents very much. Efficiency Each agent runs in its own environment. This could easily be an entirely different server than any of the other agents. By giving each agent its own resources, we get parallelism as a side effect of the paradigm we are using. Reconfigurability Because agents are independent, it is easy to reconfigure a system. Each agent reacts to its environment. They cannot directly see the actions of other agents, so if an agent is removed, added, or changed, their environment remains the same. Scalability Multiagent systems benefit from the fact that each of them can be run independently of the others. Because of this, multiagent systems are very scalable, with the ability to add agents at any time. However, if all the agents are reading from the same environment (ie a database), there could eventually be bottlenecking at the database level. This is not different from an object-oriented system. 3 Big Questions About Learning What data are we looking at? How are we assigning rewards? When have we learned enough? Is it time to start exploiting what we’ve learned? Single Agent Learning This is a model of a non-MAS learning system. The program does actions on the data, and receives back the results of those actions (or the “correct”) action to learn. The program applies whatever learning algorithm it uses to that information and continues on. Single Agent Learning Characteristics: The program is given information that never changes. This produces a consistent answer. This converges to a solution over time. UFO Game In UFO Game, the program provides complete information to the learning algorithm. The learning works because the data correct answer for a given set of data is the same every time. Multiagent Learning This is a model of a multi-agent system. Each agent can perform actions that change the data. Each agent has a task that it is trying to perform. Multiagent Learning Characteristics: Information is not static, because the data can be changed by other agents. Convergence is not guaranteed, because the best solution may change based on each agent’s actions. Alternative Model An Alternative Model: If the data could be represented this way, then the Agent could have a total picture of the information. Changes made by other agents could be directly observed. Data would be effectively static, because all changes would be accounted for. Alternative Model Problems with this model: Often, other agent’s actions cannot be directly observed. If they can be observed, the observation eliminates the robustness and reconfigurability of the system, because each agent would have to be altered for different configurations of other agents. Alternative Model Problems with this model: For each agent: For each possible agent’s action: Dimensions = dimensions + 1 This raises our dimensionality drastically. We want a method of learning that leaves our dimensionality close to where it is. Challenges Posed by MAL The Credit Assignment Problem: If the actions of other agents cannot be observed, how can we tell whose action caused an effect? Example: 4 agents have a random byte. There is a single random byte for all agents to act on. All agents must use a logical operation (AND, OR, NOR, NAND, XOR) on the byte. Agent 1’s goal is to make bit 1 of the byte a 1. All agents decide on their actions, which are then executed in a random order, and the result is sent back to the agents. If Agent 1 observes that bit 1 is a 1, then how can Agent 1 tell if his operation made the bit a 1? If he had picked a different action, would the bit still be a 1? Challenges Posed by MAL With instant feedback, the credit assignment problem is solved. The state can be observed immediately before and after an action to determine the changes caused by the action. However, there are few real-world problems where this is the case. With delayed feedback and multiple agents, it is impossible to assign credit to any agent about the results of their actions. In the time between acting and observing, any or all of the other agents may change the state of the data. Challenges Posed by MAL Agents can only learn through the rewards they receive. If credit cannot be given to an agent for the results of their action, they cannot gain rewards and cannot learn. How can we assign rewards? Rewards in Cooperative Systems Terms for Reward Systems Alignment - the correlation between system-wide rewards and local rewards Also called “Factoredness” A reward system with high alignment gives high local rewards when high systemwide rewards are given. With high alignment, “When the system works, I get a reward” Sensitivity - the correlation between an agent’s actions and it’s total reward With high sensitivity, “When I do my part, I get a reward” System Reward Only This can work, but it works best with very few agents (best with 1, in fact). Agents with delayed effects have a very hard time learning. Local Reward Only This is perfectly learnable for individual agents. There is low alignment, so agents are not rewarded significantly for the finished product. Problems: Let’s say the agents learn at different rates. The slow learners will bottleneck the fast ones. What if the problem was the bit flipping example instead of the assembly line? Each agent may sabotage each other accidentally trying to get local rewards. Difference Reward Agents are rewarded as follows: Reward = (Reward after action) - (Reward after a simulated null action) This reward allows each agent to be rewarded by their contribution to the total system reward. Problems: Simulations may be imperfect or impossible. If you can simulate the effects of your actions, you may as well simulate all of your actions and pick the highest rewarding action every time. In real life, this is impossible for most problems. Action-Value Learning Agents can remember the rewards obtained for each of its actions. Let’s say you have n slot machines, each with different odds. How can you learn which is the best to use? The easiest way to learn is to try each slot machine many times, and record the results of every pull. By taking the average rewards over a large number of pulls, you can determine which slot machine has the highest payout. Multiple Agents in Action-Value Learning: The Bar Problem There are b bars in a city, each with a different capacity, cb. There are n agents in the city, who are thirsty from a hard day’s work. Each must choose a bar to go to. If the bar is close to empty, the rewards the agents get is small. If there the bar is too crowded, then the reward is also small. But if the bar has cb people in it, the reward is high. How can Action-Value Learning solve this problem? Direct Policy Adjustment Direct policy adjustment is an alternative to action-value learning. Each agent has a “policy”, which is the agent’s actions, each with a probability of that action being chosen. Agents remember their policy at all time steps (t). Rewards are normalized to 0 <= reward <= 1 For each time step t: Action = Choose action based on Policy[t] Policy[t+1][action] = reward + (1-reward)*Policy[t][action] For each unchosen action a_not: Policy[t+1] = (1-reward)*Policy[t][a_not] Direct Policy Adjustment While learning: For each time step t: Action = Choose action based on Policy[t] Policy[t+1][action] = reward + (1-reward)*Policy[t][action] For each unchosen action a_not: Policy[t+1] = (1-reward)*Policy[t][a_not] So, for a reward of 1, the policy becomes 1 for that action and 0 for the rest. For a reward of 0, the policy becomes 0 for that action and 1 for the rest. For a reward that’s large, Policy[t+1][a] goes up and Policy[t+1][a_not] goes down. For a reward that’s small, Policy[t+1][a] goes down and Policy[t+1][a_not] goes up. Flaws in Direct Policy Adjustment Can easily get stuck in a local maximum, such as in the bar problem. It can convince itself that there is some optimal sequence of bars to go to each night, when really it depends on the actions of the other agents. Neither technique shown here will necessarily, nor should they, converge to a perfect solution. This is because the rewards for actions are largely based on the actions of other agents. This mirrors many real world problems. Markov Decision Processes - Notation MDPs are a Tuple <S, A, T, R> S: finite set of (S)tates A: finite set of (A)ctions T: a set of probabilities, ie the probability that taking action a1 from state s1 leads to s2 R: the rewards for taking each action from each state Markov Decision Processes - Notation MDPs are a Tuple <S, A, T, R> S: finite set of (S)tates A: finite set of (A)ctions T: a set of probabilities, ie the probability that taking action a1 from state s1 leads to s2 R: the rewards for taking each action from each state MDP’s - The Goal Find a policy (π) that has the greatest possible long term rewards. To do this, you need to be able to rank policies according to their possible rewards. We use value functions to rank policies. Value Functions This value function computes the expected value of s through t=[0-inf] Value of π for starting state s = Sum(from t=0 to t=inf)(Y^t * Reward(s at t)) 0 <= Y < 1 Y is set by the user. If you set Y to .999, the sum will converge at a very high t value, but if you set Y to .1, it will converge at a low t value. Value Functions Value of π for starting state s = Sum(from t=0 to t=inf)(Y^t * Reward(s at t)) 0 <= Y < 1 Y is set by the user. If you set Y to .999, the sum will converge at a very high t value, but if you set Y to .1, it will converge at a low t value. When is Learning Finished? If algorithms learn based purely on short term rewards (“Greedy learning”), they can converge on a solution that’s bad in the long term. Example: There is a split in the manufacturing line, because 1 robot can do it’s job twice as fast as the next robot in the line. This robot passes off all of it’s work to only one of these lines. It’s local rewards remain the same, but the work stacks up and long term rewards are only half as good as they could be. When is Learning Finished? So that machines can continue to learn past a short term optimal solution, we can use ε-learning. In ε-learning, instead of always choosing the predicted “best” action, a random action may be picked instead with probability ε. Even if we are 99% sure an action is optimal, if ε is 10%, we will pick a random option 10% of the time. This allows our value functions to continue to be updated in the long term. Q-Learning Q learning aims to find a long term the value of each action a at each individual state s. To find these Q-values, it uses the following algorithm: s = random_state let Q(s,a) = random() α = .01 # learning rate, usually small for t in time: Action = ε_choice(Q) S_prime, r = do_action(a) Q(s,a) = (1-α)*Q(s,a) + α*(r + y*max(Q(s_prime, action@s_prime)) - Q(s,a)) s = s_prime In that equation, y is the discount rate, which could be defined as the relative value of future actions. Q-Learning Why use Q-learning? It is policy independent. This means that no matter what policy you begin with, it will converge to the best solution. Q-learning is not dependent on a model. Why not use it? Calculating max(Q(s_prime, action@s_prime)) is difficult and time-intensive. In problems with large state spaces, this may be impossible. RTDP RTDP stands for Real Time Dynamic Programming RTDP is a learning technique that uses a model of the problem to converge quickly to a solution. By using a model, RTDP can estimate the value of its actions without interacting with the actual problem. This can help learning in cases where interactions can be difficult or costly. Example: by using a model of the stock market, or a mirror of it, you can train a learning model without wasting money on an untrained algorithm. RTDP v = [random values for each state s] r(s, a) = random values for each action at each state p = the probability that each action a at each state s goes to state s_prime For each time t: a = ε_choose(s) # has access to v, r and p reward, s_prime = do_action(a) r(s,a) = reward p(s_prime given (s,a)) is adjusted according to results so far v(s) = v(s) + α*(max_a(r(s,a) + y*(sum of v(other states) * their probability)) v(s)) s = s_prime The v(s) = … line can be summarized as “The value of s is set to the value of s plus the learning rate times the difference between the reward earned and the maximum other possible rewards, weighted by their probability.” Still a mouthful though. Multiagent MDP’s The algorithms presented this far have all been algorithms for single-agent learning. Agents with these learning algorithms can be combined in many ways. By sharing value functions v(s) or Q-value functions Q(s,a), they can pool their resources to learn faster. Many joint action learners that share a state and a reward. Their objective is to find a policy mapping states to actions to maximize their team rewards. Example: Independent action learners with a team reward. Each agent finds their own policy mapping the system state to their own action to maximize team rewards. Independent learners with local rewards. Each agent maps s to a to maximize their individual rewards. Multiagent MDP’s Local state, independent action with local reward. Each agent maps s to a to maximize their local rewards. Value functions are shared. Example: Google arms https://youtu.be/V05SuCSRAtg Local state, independent action with team reward. Each agent maps s to a to maximize team rewards. Joint Action Learning Q-Learning agent that learns Q-values for joint actions rather than individual actions. For JAL, Q is a function of an a-vector holding the decisions of all agents. For joint action learning to take place, all agents must be able to see the actions of all other agents. Learning Equation: Q(a) = Q(a) + α*(r-Q(a)) “The value of a is set to the value of a plus the learning rate times the difference between the reward and the expected reward.” Joint Action Learning This works because the games are stateless. However, instead of learning the value of a single action, the values of all agents’ actions is being learned, [a1, a2, a3, …] rather than a. Consider matrix games: B0 B1 Each tuple (Ax, Bx) now has it’s own Q-value and probability A0 r1 r2 given the action of B, and A can decide based on the frequency A1 r3 r4 that B plays 0 or 1 (without knowing B’s rewards). Nash-Q Learning In Nash-Q Learning, each agent observes not only the actions of the other agents, but their rewards as well. By doing this, they can calculate the Q-values for all agents. Agents then update their Q-values based on the discounted rewards for the game assuming that their strategy was permanent. Gradient Ascent Algorithms Up until now, we’ve considered α, the learning rate, to be a constant. However, there are many add-ons to the algorithms we’ve learned, where α correlates directly to the difference between the learned algorithm and some simple strategy. For example, a computer learning tic-tac-toe may have its results compared to an algorithm that chooses a corner to start and then plays randomly. If it does better than that algorithm, α will be small. If it does worse, α will be large. Real World Examples of MAL Ants have very little individual intelligence, but with simple communication tools, they can learn to form efficient societies: https://www.youtube.com/watch?v=vG-QZOTc5_Q Simulation of recruitment: http://alexbelezjaks.com/works/ant-colony-simulation/ Ant-based Algorithms Shortest Path: To find the shortest path between two points, use many agents who drop pheromones at intervals as they traverse their space. Once they find the second point, they follow their path back, dropping pheromones. Each agent can be influenced to follow a path if it has more pheromones. The shortest paths require the fewest pheromones, and will eventually dominate traffic. Ant-based Algorithms Traveling Salesman Problem: The traveling salesman problem can be seen as an extension of the shortest path algorithm. Each agent’s purpose is to find a path between each significant node on a graph. By marking each edge with pheromones, given them a “weight”, the large number of ants will prefer each of the individual paths between points, eventually converging to a solution. https://www.youtube.com/watch?v=QZnvRXRsHHg Air Traffic Control Air Traffic control aims to maximize the movement of people from place to place, without allowing the amount of airplanes in any given area to exceed the legal limit of air traffic. The altitude of each individual airplane must be adjusted to avoid collisions with other airplanes. The system must be able to adjust in a reasonable amount of time to weather and other circumstances. Actions Available These actions are available to control air traffic: An airplane may be delayed. An airplane’s altitude may be adjusted. An airplane may be re-routed through another path. An airplane’s speed may be changed. Sites Available to house Agents Any set location in geography, including individual airports and individual “fixes”. The agents may also be housed in the planes themselves. A “fix” in air traffic terminology is an intersection between two different routes between cities: Information available to the Agents Agents can see anything in range of their radar, and they can communicate via the radio to learn more information. The more information they have, the longer it takes to learn, so be selective. How would we design this system? Decide how many types of agents there are. Decide where to house the agents. Define the actions for an agent. Define the rewards for an agent.
© Copyright 2026 Paperzz