Humanity faces numerous problems of common-pool resource appropriation. This class of multi-agent social dilemma includes the problems of ensuring sustainable use of fresh water, common fisheries, grazing pastures, and irrigation systems. Abstract models of common-pool resource appropriation based on non-cooperative game theory predict that self-interested agents will generally fail to find socially positive equilibria---a phenomenon called the tragedy of the commons. However, in reality, human societies are sometimes able to discover and implement stable cooperative solutions. Decades of behavioral game theory research have sought to uncover aspects of human behavior that make this possible. Most of that work was based on laboratory experiments where participants only make a single choice: how much to appropriate. Recognizing the importance of spatial and temporal resource dynamics, a recent trend has been toward experiments in more complex real-time video game-like environments. However, standard methods of non-cooperative game theory can no longer be used to generate predictions for this case. Here we show that deep reinforcement learning can be used instead. To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling common-pool resource appropriation. Our experiments highlight the importance of trial-and-error learning in common-pool resource appropriation and shed light on the relationship between exclusion, sustainability, and inequality.
A significant amount of research in recent years has been dedicated towards single agent deep reinforcement learning. Much of the success of deep reinforcement learning can be attributed towards the use of experience replay memories within which state transitions are stored. Function approximation methods such as convolutional neural networks (referred to as deep Q-Networks, or DQNs, in this context) can subsequently be trained through sampling the stored transitions. However, considerations are required when using experience replay memories within multi-agent systems, as stored transitions can become outdated due to agents updating their respective policies in parallel . In this work we apply leniency  to multi-agent deep reinforcement learning (MA-DRL), acting as a control mechanism to determine which state-transitions sampled are allowed to update the DQN. Our resulting Lenient-DQN (LDQN) is evaluated using variations of the Coordinated Multi-Agent Object Transportation Problem (CMOTP) outlined by Busoniu et al. . The LDQN significantly outperforms the existing hysteretic DQN (HDQN)  within environments that yield stochastic rewards. Based on results from experiments conducted using vanilla and double Q-learning versions of the lenient and hysteretic algorithms, we advocate a hybrid approach where learners initially use vanilla Q-learning before transitioning to double Q-learners upon converging on a cooperative joint policy.
In social dilemmas individuals face a temptation to increase their payoffs in the short run at a cost to the long run total welfare. Much is known about how cooperation can be stabilized in the simplest of such settings: repeated Prisoner's Dilemma games. However, there is relatively little work on generalizing these insights to more complex situations. We start to fill this gap by showing how to use modern reinforcement learning methods to generalize a highly successful Prisoner's Dilemma strategy: tit-for-tat. We construct artificial agents that act in ways that are simple to understand, nice (begin by cooperating), provokable (try to avoid being exploited), and forgiving (following a bad turn try to return to mutual cooperation). We show both theoretically and experimentally that generalized tit-for-tat agents can maintain cooperation in more complex environments. In contrast, we show that employing purely reactive training techniques can lead to agents whose behavior results in socially inefficient outcomes.
This paper deals with the problem of properly simulating the Internet of Things (IoT). Simulating an IoT allows evaluating strategies that can be employed to deploy smart services over different kinds of territories. However, the heterogeneity of scenarios seriously complicates this task. This imposes the use of sophisticated modeling and simulation techniques. We discuss novel approaches for the provision of scalable simulation scenarios, that enable the real-time execution of massively populated IoT environments. Attention is given to novel hybrid and multi-level simulation techniques that, when combined with agent-based, adaptive Parallel and Distributed Simulation (PADS) approaches, can provide means to perform highly detailed simulations on demand. To support this claim, we detail a use case concerned with the simulation of vehicular transportation systems.
Finding asymptotically-optimal paths in multi-robot motion planning problems could be achieved, in principle, using sampling-based planners in the composite configuration space of all of the robots in the space. The dimensionality of this space increases with the number of robots, rendering this approach impractical. This work focuses on a scalable sampling-based planner for coupled multi-robot problems that provides asymptotic optimality. It extends the dRRT approach, which proposed building roadmaps for each robot and searching an implicit roadmap in the composite configuration space. This work presents a new method, dRRT* , and develops theory for scalable convergence to optimal paths in multi-robot problems. Simulated experiments indicate dRRT* converges to high-quality paths while scaling to higher numbers of robots where the naive approach fails. Furthermore, dRRT* is applicable to high-dimensional problems, such as planning for robot manipulators
Trust models are widely used in various computer science disciplines. The main purpose of a trust model is to continuously measure trustworthiness of a set of entities based on their behaviors. In this article, the novel notion of "rational trust modeling" is introduced by bridging trust management and game theory. Note that trust models/reputation systems have been used in game theory (e.g., repeated games) for a long time, however, game theory has not been utilized in the process of trust model construction; this is where the novelty of our approach comes from. In our proposed setting, the designer of a trust model assumes that the players who intend to utilize the model are rational/selfish, i.e., they decide to become trustworthy or untrustworthy based on the utility that they can gain. In other words, the players are incentivized (or penalized) by the model itself to act properly. The problem of trust management can be then approached by game theoretical analyses and solution concepts such as Nash equilibrium. Although rationality might be built-in in some existing trust models, we intend to formalize the notion of rational trust modeling from the designer's perspective. This approach will result in two fascinating outcomes. First of all, the designer of a trust model can incentivise trustworthiness in the first place by incorporating proper parameters into the trust function, which can be later utilized among selfish players in strategic trust-based interactions (e.g., e-commerce scenarios). Furthermore, using a rational trust model, we can prevent many well-known attacks on trust models. These two prominent properties also help us to predict behavior of the players in subsequent steps by game theoretical analyses.
Before reaching full autonomy, vehicles will gradually be equipped with more and more advanced driver assistance systems (ADAS), effectively rendering them semi-autonomous. However, current ADAS technologies seem unable to handle complex traffic situations, notably when dealing with vehicles arriving from the sides, either at intersections or when merging on highways. The high rate of accidents in these settings prove that they constitute difficult driving situations. Moreover, intersections and merging lanes are often the source of important traffic congestion and, sometimes, deadlocks. In this article, we propose a cooperative framework to safely coordinate semi-autonomous vehicles in such settings, removing the risk of collision or deadlocks while remaining compatible with human driving. More specifically, we present a supervised coordination scheme that overrides control inputs from human drivers when they would result in an unsafe or blocked situation. To avoid unnecessary intervention and remain compatible with human driving, overriding only occurs when collisions or deadlocks are imminent. In this case, safe overriding controls are chosen while ensuring they deviate minimally from those originally requested by the drivers. Simulation results based on a realistic physics simulator show that our approach is scalable to real-world scenarios, and computations can be performed in real-time on a standard computer for up to a dozen simultaneous vehicles.
Blackboard systems are motivated by the popular view of task forces as brainstorming groups in which specialists write promising ideas to solve a problem in a central blackboard. Here we study a minimal model of blackboard system designed to solve cryptarithmetic puzzles, where hints are posted anonymously on a public display (standard blackboard) or are posted together with information about the reputations of the agents that posted them (reputation blackboard). We find that the reputation blackboard always outperforms the standard blackboard, which, in turn, always outperforms the independent search. The asymptotic distribution of the computational cost of the search, which is proportional to the total number of agent updates required to find the solution of the puzzle, is an exponential distribution for those three search heuristics. Only for the reputation blackboard we find a nontrivial dependence of the mean computational cost on the system size and, in that case, the optimal performance is achieved by a single agent working alone, indicating that, though the blackboard organization can produce impressive performance gains when compared with the independent search, it is not very supportive of cooperative work.
Jun 15 2017 cs.MA
We summarise the results of RoboCup 2D Soccer Simulation League in 2016 (Leipzig), including the main competition and the evaluation round. The evaluation round held in Leipzig confirmed the strength of RoboCup-2015 champion (WrightEagle, i.e. WE2015) in the League, with only eventual finalists of 2016 competition capable of defeating WE2015. An extended, post-Leipzig, round-robin tournament which included the top 8 teams of 2016, as well as WE2015, with over 1000 games played for each pair, placed WE2015 third behind the champion team (Gliders2016) and the runner-up (HELIOS2016). This establishes WE2015 as a stable benchmark for the 2D Simulation League. We then contrast two ranking methods and suggest two options for future evaluation challenges. The first one, "The Champions Simulation League", is proposed to include 6 previous champions, directly competing against each other in a round-robin tournament, with the view to systematically trace the advancements in the League. The second proposal, "The Global Challenge", is aimed to increase the realism of the environmental conditions during the simulated games, by simulating specific features of different participating countries.
In the context of solving large distributed constraint optimization problems (DCOP), belief-propagation and approximate inference algorithms are candidates of choice. However, in general, when the factor graph is very loopy (i.e. cyclic), these solution methods suffer from bad performance, due to non-convergence and many exchanged messages. As to improve performances of the Max-Sum inference algorithm when solving loopy constraint optimization problems, we propose here to take inspiration from the belief-propagation-guided dec-imation used to solve sparse random graphs (k-satisfiability). We propose the novel DeciMaxSum method, which is parameterized in terms of policies to decide when to trigger decimation, which variables to decimate, and which values to assign to decimated variables. Based on an empirical evaluation on a classical BP benchmark (the Ising model), some of these combinations of policies exhibit better performance than state-of-the-art competitors.
In this paper, we address the problem of controlling a network of mobile sensors so that a set of hidden states are estimated up to a user-specified accuracy. The sensors take measurements and fuse them online using an Information Consensus Filter (ICF). At the same time, the local estimates guide the sensors to their next best configuration. This leads to an LMI-constrained optimization problem that we solve by means of a new distributed random approximate projections method. The new method is robust to the state disagreement errors that exist among the robots as the ICF fuses the collected measurements. Assuming that the noise corrupting the measurements is zero-mean and Gaussian and that the robots are self localized in the environment, the integrated system converges to the next best positions from where new observations will be taken. This process is repeated with the robots taking a sequence of observations until the hidden states are estimated up to the desired user-specified accuracy. We present simulations of sparse landmark localization, where the robotic team achieves the desired estimation tolerances while exhibiting interesting emergent behavior.
We present a real-time algorithm, SocioSense, for socially-aware navigation of a robot amongst pedestrians. Our approach computes time-varying behaviors of each pedestrian using Bayesian learning and Personality Trait theory. These psychological characteristics are used for long-term path prediction and generating proximic characteristics for each pedestrian. We combine these psychological constraints with social constraints to perform human-aware robot navigation in low- to medium-density crowds. The estimation of time-varying behaviors and pedestrian personalities can improve the performance of long-term path prediction by 21%, as compared to prior interactive path prediction algorithms. We also demonstrate the benefits of our socially-aware navigation in simulated environments with tens of pedestrians.
Jun 06 2017 cs.MA
The efficient use of available resources is a key factor in achieving success on both personal and organizational levels. One of the crucial resources in knowledge economy is time. The ability to force others to adapt to our schedule even if it harms their efficiency can be seen as an outcome of social stratification. The principal objective of this paper is to use time allocation to model and study the global efficiency of social stratification, and to reveal whether hierarchy is an emergent property. A multi-agent model with an evolving social network is used to verify our hypotheses. The network's evolution is driven by the intensity of inter-agent communications, and the communications as such depend on the preferences and time resources of the communicating agents. The entire system is to be perceived as a metaphor of a social network of people regularly filling out agenda for their meetings for a period of time. The overall efficiency of the network of those scheduling agents is measured by the average utilization of the agent's preferences to speak on specific subjects. The simulation results shed light on the effects of different scheduling methods, resource availabilities, and network evolution mechanisms on communication system efficiency. The non-stratified systems show better long-term efficiency. Moreover, in the long term hierarchy disappears in overwhelming majority of cases. Some exceptions are observed for cases where privileges are granted on the basis of node degree weighted by relationship intensities but only in the short term.
In this paper, we develop a distributed intermittent communication and task planning framework for teams of mobile robots. The goal of the robots is to accomplish complex tasks, captured by local Linear Temporal Logic formulas, and share the collected information with all other robots and possibly also with a user. Specifically, we consider situations where the robot communication capabilities are not sufficient to form reliable and connected networks while the robots move to accomplish their tasks. In this case, intermittent communication protocols are necessary that allow the robots to temporarily disconnect from the network in order to accomplish their tasks free of communication constraints. We assume that the robots can only communicate with each other when they meet at common locations in space. Our distributed control framework jointly determines local plans that allow all robots fulfill their assigned temporal tasks, sequences of communication events that guarantee information exchange infinitely often, and optimal communication locations that minimize the total distance traveled by the robots. Simulation and experimental results verify the efficacy of the proposed distributed controllers.
We consider the following control problem on fair allocation of indivisible goods. Given a set $I$ of items and a set of agents, each having strict linear preference over the items, we ask for a minimum subset of the items whose deletion guarantees the existence of a proportional allocation in the remaining instance; we call this problem Proportionality by Item Deletion (PID). Our main result is a polynomial-time algorithm that solves PID for three agents. By contrast, we prove that PID is computationally intractable when the number of agents is unbounded, even if the number $k$ of item deletions allowed is small, since the problem turns out to be W-hard with respect to the parameter $k$. Additionally, we provide some tight lower and upper bounds on the complexity of PID when regarded as a function of $|I|$ and $k$.
The multi-agent path-finding (MAPF) problem has recently received a lot of attention. However, it does not capture important characteristics of many real-world domains, such as automated warehouses, where agents are constantly engaged with new tasks. In this paper, we therefore study a lifelong version of the MAPF problem, called the multi-agent pickup and delivery (MAPD) problem. In the MAPD problem, agents have to attend to a stream of delivery tasks in an online setting. One agent has to be assigned to each delivery task. This agent has to first move to a given pickup location and then to a given delivery location while avoiding collisions with other agents. We present two decoupled MAPD algorithms, Token Passing (TP) and Token Passing with Task Swaps (TPTS). Theoretically, we show that they solve all well-formed MAPD instances, a realistic subclass of MAPD instances. Experimentally, we compare them against a centralized strawman MAPD algorithm without this guarantee in a simulated warehouse system. TP can easily be extended to a fully distributed MAPD algorithm and is the best choice when real-time computation is of primary concern since it remains efficient for MAPD instances with hundreds of agents and tasks. TPTS requires limited communication among agents and balances well between TP and the centralized MAPD algorithm.
Inspired by previous work on emergent language in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal language significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.
May 30 2017 cs.MA
Society has become more dependent on automated intelligent systems, at the same time, these systems have become more and more complicated. Society's expectation regarding the capabilities and intelligence of such systems has also grown. We have become a more complicated society with more complicated problems. As the expectation of intelligent systems rises, we discover many more applications for artificial intelligence. Additionally, as the difficulty level and computational requirements of such problems rise, there is a need to distribute the problem solving. Although the field of multiagent systems (MAS) and distributed artificial intelligence (DAI) is relatively young, the importance and applicability of this technology for solving today's problems continue to grow. In multiagent systems, the main goal is to provide fruitful cooperation among agents in order to enrich the support given to all user activities. This paper deals with the development of a multiagent system aimed at solving the reservation problems encountered in rural tourism. Due to their benefits over the last few years, online travel agencies have become a very useful instrument in planning vacations. A MAS concept (which is based on the Internet exploitation) can improve this activity and provide clients with a new, rapid and efficient way of making accommodation arrangements.
The existence of a coalition strategy to achieve a goal does not necessarily mean that the coalition has enough information to know how to follow the strategy. Neither does it mean that the coalition knows that such a strategy exists. The article studies an interplay between the distributed knowledge, coalition strategies, and coalition "know-how" strategies. The main technical result is a sound and complete trimodal logical system that describes the properties of this interplay.
Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
This paper explores the use of Answer Set Programming (ASP) in solving Distributed Constraint Optimization Problems (DCOPs). The paper provides the following novel contributions: (1) It shows how one can formulate DCOPs as logic programs; (2) It introduces ASP-DPOP, the first DCOP algorithm that is based on logic programming; (3) It experimentally shows that ASP-DPOP can be up to two orders of magnitude faster than DPOP (its imperative programming counterpart) as well as solve some problems that DPOP fails to solve, due to memory limitations; and (4) It demonstrates the applicability of ASP in a wide array of multi-agent problems currently modeled as DCOPs. Under consideration in Theory and Practice of Logic Programming (TPLP).
In this paper we describe a simulation platform that supports studies on the impact of crime on urban mobility. We present an example of how this can be achieved by seeking to understand the effect, on the transport system, if users of this system decide to choose optimal routes of time between origins and destinations that normally follow. Based on real data from a large Brazilian metropolis, we found that the percentage of users who follow this policy is small. Most prefer to follow less efficient routes by making bus exchanges at terminals. This can be understood as an indication that the users of the transport system privilege the security factor.
This paper explores the Coevolutionary Optional Prisoner's Dilemma (COPD) game, which is a simple model to coevolve game strategy and link weights of agents playing the Optional Prisoner's Dilemma game. We consider a population of agents placed in a lattice grid with boundary conditions. A number of Monte Carlo simulations are performed to investigate the impacts of the COPD game on the emergence of cooperation. Results show that the coevolutionary rules enable cooperators to survive and even dominate, with the presence of abstainers in the population playing a key role in the protection of cooperators against exploitation from defectors. We observe that in adverse conditions such as when the initial population of abstainers is too scarce/abundant, or when the temptation to defect is very high, cooperation has no chance of emerging. However, when the simple coevolutionary rules are applied, cooperators flourish.
The computability power of a distributed computing model is determined by the communication media available to the processes, the timing assumptions about processes and communication, and the nature of failures that processes can suffer. In a companion paper we showed how dynamic epistemic logic can be used to give a formal semantics to a given distributed computing model, to capture precisely the knowledge needed to solve a distributed task, such as consensus. Furthermore, by moving to a dual model of epistemic logic defined by simplicial complexes, topological invariants are exposed, which determine task solvability. In this paper we show how to extend the setting above to include in the knowledge of the processes, knowledge about the model of computation itself. The extension describes the knowledge processes gain about the current execution, in problems where processes have no input values at all.
The analysis in Part I revealed interesting properties for subgradient learning algorithms in the context of stochastic optimization when gradient noise is present. These algorithms are used when the risk functions are non-smooth and involve non-differentiable components. They have been long recognized as being slow converging methods. However, it was revealed in Part I that the rate of convergence becomes linear for stochastic optimization problems, with the error iterate converging at an exponential rate $\alpha^i$ to within an $O(\mu)-$neighborhood of the optimizer, for some $\alpha \in (0,1)$ and small step-size $\mu$. The conclusion was established under weaker assumptions than the prior literature and, moreover, several important problems (such as LASSO, SVM, and Total Variation) were shown to satisfy these weaker assumptions automatically (but not the previously used conditions from the literature). These results revealed that sub-gradient learning methods have more favorable behavior than originally thought when used to enable continuous adaptation and learning. The results of Part I were exclusive to single-agent adaptation. The purpose of the current Part II is to examine the implications of these discoveries when a collection of networked agents employs subgradient learning as their cooperative mechanism. The analysis will show that, despite the coupled dynamics that arises in a networked scenario, the agents are still able to attain linear convergence in the stochastic case; they are also able to reach agreement within $O(\mu)$ of the optimizer.
In large-scale natural disasters, humans are likely to fail when they attempt to reach high-risk sites or act in search and rescue operations. Robots, however, outdo their counterparts in surviving the hazards and handling the search and rescue missions due to their multiple and diverse sensing and actuation capabilities. The dynamic formation of optimal coalition of these heterogeneous robots for cost efficiency is very challenging and research in the area is gaining more and more attention. In this paper, we propose a novel heuristic. Since the population of robots in large-scale disaster settings is very large, we rely on Quantum Multi-Objective Particle Swarm Optimization (QMOPSO). The problem is modeled as a multi-objective optimization problem. Simulations with different test cases and metrics, and comparison with other algorithms such as NSGA-II and SPEA-II are carried out. The experimental results show that the proposed algorithm outperforms the existing algorithms not only in terms of convergence but also in terms of diversity and processing time.
In reinforcement learning, agents learn by performing actions and observing their outcomes. Sometimes, it is desirable for a human operator to \textitinterrupt an agent in order to prevent dangerous situations from happening. Yet, as part of their learning process, agents may link these interruptions, that impact their reward, to specific states and deliberately avoid them. The situation is particularly challenging in a multi-agent context because agents might not only learn from their own past interruptions, but also from those of other agents. Orseau and Armstrong defined \emphsafe interruptibility for one learner, but their work does not naturally extend to multi-agent systems. This paper introduces \textitdynamic safe interruptibility, an alternative definition more suited to decentralized learning problems, and studies this notion in two learning frameworks: \textitjoint action learners and \textitindependent learners. We give realistic sufficient conditions on the learning algorithm to enable dynamic safe interruptibility in the case of joint action learners, yet show that these conditions are not sufficient for independent learners. We show however that if agents can detect interruptions, it is possible to prune the observations to ensure dynamic safe interruptibility even for independent learners.
We study the problem of cooperative inference where a group of agents interact over a network and seek to estimate a joint parameter that best explains a set of observations. Agents do not know the network topology or the observations of other agents. We explore a variational interpretation of the Bayesian posterior density, and its relation to the stochastic mirror descent algorithm, to propose a new distributed learning algorithm. We show that, under appropriate assumptions, the beliefs generated by the proposed algorithm concentrate around the true parameter exponentially fast. We provide explicit non-asymptotic bounds for the convergence rate. Moreover, we develop explicit and computationally efficient algorithms for observation models belonging to exponential families.
In this paper, we argue that the future of Artificial Intelligence research resides in two keywords: integration and embodiment. We support this claim by analyzing the recent advances of the field. Regarding integration, we note that the most impactful recent contributions have been made possible through the integration of recent Machine Learning methods (based in particular on Deep Learning and Recurrent Neural Networks) with more traditional ones (e.g. Monte-Carlo tree search, goal babbling exploration or addressable memory systems). Regarding embodiment, we note that the traditional benchmark tasks (e.g. visual classification or board games) are becoming obsolete as state-of-the-art learning algorithms approach or even surpass human performance in most of them, having recently encouraged the development of first-person 3D game platforms embedding realistic physics. Building upon this analysis, we first propose an embodied cognitive architecture integrating heterogenous sub-fields of Artificial Intelligence into a unified framework. We demonstrate the utility of our approach by showing how major contributions of the field can be expressed within the proposed framework. We then claim that benchmarking environments need to reproduce ecologically-valid conditions for bootstrapping the acquisition of increasingly complex cognitive skills through the concept of a cognitive arms race between embodied agents.
The usual epistemic S5 model for multi-agent systems is a Kripke graph, whose edges are labeled with the agents that do not distinguish between two states. We propose to uncover the higher dimensional information implicit in the Kripke graph, by using as a model its dual, a chromatic simplicial complex. For each state of the Kripke model there is a facet in the complex, with one vertex per agent. If an edge (u,v) is labeled with a set of agents S, the facets corresponding to u and v intersect in a simplex consisting of one vertex for each agent of S. Then we use dynamic epistemic logic to study how the simplicial complex epistemic model changes after the agents communicate with each other. We show that there are topological invariants preserved from the initial epistemic complex to the epistemic complex after an action model is applied, that depend on how reliable the communication is. In turn these topological properties determine the knowledge that the agents may gain after the communication happens.
We present AutonoVi:, a novel algorithm for autonomous vehicle navigation that supports dynamic maneuvers and satisfies traffic constraints and norms. Our approach is based on optimization-based maneuver planning that supports dynamic lane-changes, swerving, and braking in all traffic scenarios and guides the vehicle to its goal position. We take into account various traffic constraints, including collision avoidance with other vehicles, pedestrians, and cyclists using control velocity obstacles. We use a data-driven approach to model the vehicle dynamics for control and collision avoidance. Furthermore, our trajectory computation algorithm takes into account traffic rules and behaviors, such as stopping at intersections and stoplights, based on an arc-spline representation. We have evaluated our algorithm in a simulated environment and tested its interactive performance in urban and highway driving scenarios with tens of vehicles, pedestrians, and cyclists. These scenarios include jaywalking pedestrians, sudden stops from high speeds, safely passing cyclists, a vehicle suddenly swerving into the roadway, and high-density traffic where the vehicle must change lanes to progress more effectively.
Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.
Robust environment perception is essential for decision-making on robots operating in complex domains. Intelligent task execution requires principled treatment of uncertainty sources in a robot's observation model. This is important not only for low-level observations (e.g., accelerometer data), but also for high-level observations such as semantic object labels. This paper formalizes the concept of macro-observations in Decentralized Partially Observable Semi-Markov Decision Processes (Dec-POSMDPs), allowing scalable semantic-level multi-robot decision making. A hierarchical Bayesian approach is used to model noise statistics of low-level classifier outputs, while simultaneously allowing sharing of domain noise characteristics between classes. Classification accuracy of the proposed macro-observation scheme, called Hierarchical Bayesian Noise Inference (HBNI), is shown to exceed existing methods. The macro-observation scheme is then integrated into a Dec-POSMDP planner, with hardware experiments running onboard a team of dynamic quadrotors in a challenging domain where noise-agnostic filtering fails. To the best of our knowledge, this is the first demonstration of a real-time, convolutional neural net-based classification framework running fully onboard a team of quadrotors in a multi-robot decision-making domain.
Voting systems typically treat all voters equally. We argue that perhaps they should not: Voters who have supported good choices in the past should be given higher weight than voters who have supported bad ones. To develop a formal framework for desirable weighting schemes, we draw on no-regret learning. Specifically, given a voting rule, we wish to design a weighting scheme such that applying the voting rule, with voters weighted by the scheme, leads to choices that are almost as good as those endorsed by the best voter in hindsight. We derive possibility and impossibility results for the existence of such weighting schemes, depending on whether the voting rule and the weighting scheme are deterministic or randomized, as well as on the social choice axioms satisfied by the voting rule.
Deep neural networks coupled with fast simulation and improved computation have led to recent successes in the field of reinforcement learning (RL). However, most current RL-based approaches fail to generalize since: (a) the gap between simulation and real world is so large that policy-learning approaches fail to transfer; (b) even if policy learning is done in real world, the data scarcity leads to failed generalization from training to test scenarios (e.g., due to different friction or object masses). Inspired from H-infinity control methods, we note that both modeling errors and differences in training and test scenarios can be viewed as extra forces/disturbances in the system. This paper proposes the idea of robust adversarial reinforcement learning (RARL), where we train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. The jointly trained adversary is reinforced -- that is, it learns an optimal destabilization policy. We formulate the policy learning as a zero-sum, minimax objective function. Extensive experiments in multiple environments (InvertedPendulum, HalfCheetah, Swimmer, Hopper and Walker2d) conclusively demonstrate that our method (a) improves training stability; (b) is robust to differences in training/test conditions; and c) outperform the baseline even in the absence of the adversary.
Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.
Congestion problems are omnipresent in today's complex networks and represent a challenge in many research domains. In the context of Multi-agent Reinforcement Learning (MARL), approaches like difference rewards and resource abstraction have shown promising results in tackling such problems. Resource abstraction was shown to be an ideal candidate for solving large-scale resource allocation problems in a fully decentralized manner. However, its performance and applicability strongly depends on some, until now, undocumented assumptions. Two of the main congestion benchmark problems considered in the literature are: the Beach Problem Domain and the Traffic Lane Domain. In both settings the highest system utility is achieved when overcrowding one resource and keeping the rest at optimum capacity. We analyse how abstract grouping can promote this behaviour and how feasible it is to apply this approach in a real-world domain (i.e., what assumptions need to be satisfied and what knowledge is necessary). We introduce a new test problem, the Road Network Domain (RND), where the resources are no longer independent, but rather part of a network (e.g., road network), thus choosing one path will also impact the load on other paths having common road segments. We demonstrate the application of state-of-the-art MARL methods for this new congestion model and analyse their performance. RND allows us to highlight an important limitation of resource abstraction and show that the difference rewards approach manages to better capture and inform the agents about the dynamics of the environment.
Successful analysis of player skills in video games has important impacts on the process of enhancing player experience without undermining their continuous skill development. Moreover, player skill analysis becomes more intriguing in team-based video games because such form of study can help discover useful factors in effective team formation. In this paper, we consider the problem of skill decomposition in MOBA (MultiPlayer Online Battle Arena) games, with the goal to understand what player skill factors are essential for the outcome of a game match. To understand the construct of MOBA player skills, we utilize various skill-based predictive models to decompose player skills into interpretative parts, the impact of which are assessed in statistical terms. We apply this analysis approach on two widely known MOBAs, namely League of Legends (LoL) and Defense of the Ancients 2 (DOTA2). The finding is that base skills of in-game avatars, base skills of players, and players' champion-specific skills are three prominent skill components influencing LoL's match outcomes, while those of DOTA2 are mainly impacted by in-game avatars' base skills but not much by the other two.
This paper addresses tracking of a moving target in a multi-agent network. The target follows a linear dynamics corrupted by an adversarial noise, i.e., the noise is not generated from a statistical distribution. The location of the target at each time induces a global time-varying loss function, and the global loss is a sum of local losses, each of which is associated to one agent. Agents noisy observations could be nonlinear. We formulate this problem as a distributed online optimization where agents communicate with each other to track the minimizer of the global loss. We then propose a decentralized version of the Mirror Descent algorithm and provide the non-asymptotic analysis of the problem. Using the notion of dynamic regret, we measure the performance of our algorithm versus its offline counterpart in the centralized setting. We prove that the bound on dynamic regret scales inversely in the network spectral gap, and it represents the adversarial noise causing deviation with respect to the linear dynamics. Our result subsumes a number of results in the distributed optimization literature. Finally, in a numerical experiment, we verify that our algorithm can be simply implemented for multi-agent tracking with nonlinear observations.
In multi-robot systems where a central decision maker is specifying the movement of each individual robot, a communication failure can severely impair the performance of the system. This paper develops a motion strategy that allows robots to safely handle critical communication failures for such multi-robot architectures. For each robot, the proposed algorithm computes a time horizon over which collisions with other robots are guaranteed not to occur. These safe time horizons are included in the commands being transmitted to the individual robots. In the event of a communication failure, the robots execute the last received velocity commands for the corresponding safe time horizons leading to a provably safe open-loop motion strategy. The resulting algorithm is computationally effective and is agnostic to the task that the robots are performing. The efficacy of the strategy is verified in simulation as well as on a team of differential-drive mobile robots.
We consider a swarm of $n$ autonomous mobile robots, distributed on a 2-dimensional grid. A basic task for such a swarm is the gathering process: all robots have to gather at one (not predefined) place. The work in this paper is motivated by the following insight: On one side, for swarms of robots distributed in the 2-dimensional Euclidean space, several gathering algorithms are known for extremely simple robots that are oblivious, have bounded viewing radius, no compass, and no "flags" to communicate a status to others. On the other side, in case of the 2-dimensional grid, the only known gathering algorithms for robots with bounded viewing radius without compass, need to memorize a constant number of rounds and need flags. In this paper we contribute the, to the best of our knowledge, first gathering algorithm on the grid, which works for anonymous, oblivious robots with bounded viewing range, without any further means of communication and without any memory. We prove its correctness and an $O(n^2)$ time bound. This time bound matches those of the best known algorithms for the Euclidean plane mentioned above.
Matrix games like Prisoner's Dilemma have guided research on social dilemmas for decades. However, they necessarily treat the choice to cooperate or defect as an atomic action. In real-world social dilemmas these choices are temporally extended. Cooperativeness is a property that applies to policies, not elementary actions. We introduce sequential social dilemmas that share the mixed incentive structure of matrix game social dilemmas but also require agents to learn policies that implement their strategic intentions. We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q-network, on two Markov games we introduce here: 1. a fruit Gathering game and 2. a Wolfpack hunting game. We characterize how learned behavior in each domain changes as a function of environmental factors including resource abundance. Our experiments show how conflict can emerge from competition over shared resources and shed light on how the sequential nature of real world social dilemmas affects cooperation.
Interactions between vehicles and pedestrians have always been a major problem in traffic safety. Experienced human drivers are able to analyze the environment and choose driving strategies that will help them avoid crashes. What is not yet clear, however, is how automated vehicles will interact with pedestrians. This paper proposes a new method for evaluating the safety and feasibility of the driving strategy of automated vehicles when encountering unsignalized crossings. MobilEye sensors installed on buses in Ann Arbor, Michigan, collected data on 2,973 valid crossing events. A stochastic interaction model was then created using a multivariate Gaussian mixture model. This model allowed us to simulate the movements of pedestrians reacting to an oncoming vehicle when approaching unsignalized crossings, and to evaluate the passing strategies of automated vehicles. A simulation was then conducted to demonstrate the evaluation procedure.
Social conventions govern countless behaviors all of us engage in every day, from how we greet each other to the languages we speak. But how can shared conventions emerge spontaneously in the absence of a central coordinating authority? The Naming Game model shows that networks of locally interacting individuals can spontaneously self-organize to produce global coordination. Here, we provide a gentle introduction to the main features of the model, from the dynamics observed in homogeneously mixing populations to the role played by more complex social networks, and to how slight modifications of the basic interaction rules give origin to a richer phenomenology in which more conventions can co-exist indefinitely.
Opponent modeling consists in modeling the strategy or preferences of an agent thanks to the data it provides. In the context of automated negotiation and with machine learning, it can result in an advantage so overwhelming that it may restrain some casual agents to be part of the bargaining process. We qualify as "curious" an agent driven by the desire of negotiating in order to collect information and improve its opponent model. However, neither curiosity-based rational-ity nor curiosity-robust protocol have been studied in automatic negotiation. In this paper, we rely on mechanism design to propose three extensions of the standard bargaining protocol that limit information leak. Those extensions are supported by an enhanced rationality model, that considers the exchanged information. Also, they are theoretically analyzed and experimentally evaluated.
Algorithms for equilibrium computation generally make no attempt to ensure that the computed strategies are understandable by humans. For instance the strategies for the strongest poker agents are represented as massive binary files. In many situations, we would like to compute strategies that can actually be implemented by humans, who may have computational limitations and may only be able to remember a small number of features or components of the strategies that have been computed. We study poker games where private information distributions can be arbitrary. We create a large training set of game instances and solutions, by randomly selecting the information probabilities, and present algorithms that learn from the training instances in order to perform well in games with unseen information distributions. We are able to conclude several new fundamental rules about poker strategy that can be easily implemented by humans.
The paper studies the problem of achieving consensus in multi-agent systems in the case where the dependency digraph $\Gamma$ has no spanning in-tree. We consider the regularization protocol that amounts to the addition of a dummy agent (hub) uniformly connected to the agents. The presence of such a hub guarantees the achievement of an asymptotic consensus. For the "evaporation" of the dummy agent, the strength of its influences on the other agents vanishes, which leads to the concept of latent consensus. We obtain a closed-form expression for the consensus when the connections of the hub are symmetric, in this case, the impact of the hub upon the consensus remains fixed. On the other hand, if the hub is essentially influenced by the agents, whereas its influence on them tends to zero, then the consensus is expressed by the scalar product of the vector of column means of the Laplacian eigenprojection of $\Gamma$ and the initial state vector of the system. Another protocol, which assumes the presence of vanishingly weak uniform background links between the agents, leads to the same latent consensus.
We present a distributed (non-Bayesian) learning algorithm for the problem of parameter estimation with Gaussian noise. The algorithm is expressed as explicit updates on the parameters of the Gaussian beliefs (i.e. means and precision). We show a convergence rate of $O(1/k)$ with the constant term depending on the number of agents and the topology of the network. Moreover, we show almost sure convergence to the optimal solution of the estimation problem for the general case of time-varying directed graphs.
In this paper we extend the principle of proportional representation to rankings. We consider the setting where alternatives need to be ranked based on approval preferences. In this setting, proportional representation requires that cohesive groups of voters are represented proportionally in each initial segment of the ranking. Proportional rankings are desirable in situations where initial segments of different lengths may be relevant, e.g., hiring decisions (if it is unclear how many positions are to be filled), the presentation of competing proposals on a liquid democracy platform (if it is unclear how many proposals participants are taking into consideration), or recommender systems (if a ranking has to accommodate different user types). We study the proportional representation provided by several ranking methods and prove theoretical guarantees. Furthermore, we experimentally evaluate these methods and present preliminary evidence as to which methods are most suitable for producing proportional rankings.
In this paper, we discuss how to design the graph topology to reduce the communication complexity of certain algorithms for decentralized optimization. Our goal is to minimize the total communication needed to achieve a prescribed accuracy. We discover that the so-called expander graphs are near-optimal choices. We propose three approaches to construct expander graphs for different numbers of nodes and node degrees. Our numerical results show that the performance of decentralized optimization is significantly better on expander graphs than other regular graphs.