In this paper, we conduct an empirical study on discovering the ordered collective dynamics obtained by a population of artificial intelligence (AI) agents. Our intention is to put AI agents into a simulated natural context, and then to understand their induced dynamics at the population level. In particular, we aim to verify if the principles developed in the real world could also be used in understanding an artificially-created intelligent population. To achieve this, we simulate a large-scale predator-prey world, where the laws of the world are designed by only the findings or logical equivalence that have been discovered in nature. We endow the agents with the intelligence based on deep reinforcement learning, and scale the population size up to millions. Our results show that the population dynamics of AI agents, driven only by each agent's individual self interest, reveals an ordered pattern that is similar to the Lotka-Volterra model studied in population biology. We further discover the emergent behaviors of collective adaptations in studying how the agents' grouping behaviors will change with the environmental resources. Both of the two findings could be explained by the self-organization theory in nature.
We introduce an axiomatic approach to group recommendations, in line of previous work on the axiomatic treatment of trust-based recommendation systems, ranking systems, and other foundational work on the axiomatic approach to internet mechanisms in social choice settings. In group recommendations we wish to recommend to a group of agents, consisting of both opinionated and undecided members, a joint choice that would be acceptable to them. Such a system has many applications, such as choosing a movie or a restaurant to go to with a group of friends, recommending games for online game players, & other communal activities. Our method utilizes a given social graph to extract information on the undecided, relying on the agents influencing them. We first show that a set of fairly natural desired requirements (a.k.a axioms) leads to an impossibility, rendering mutual satisfaction of them unreachable. However, we also show a modified set of axioms that fully axiomatize a group variant of the random-walk recommendation system, expanding a previous result from the individual recommendation case.
Swarm systems constitute a challenging problem for reinforcement learning (RL) as the algorithm needs to learn decentralized control policies that can cope with limited local sensing and communication abilities of the agents. Although there have been recent advances of deep RL algorithms applied to multi-agent systems, learning communication protocols while simultaneously learning the behavior of the agents is still beyond the reach of deep RL algorithms. However, while it is often difficult to directly define the behavior of the agents, simple communication protocols can be defined more easily using prior knowledge about the given task. In this paper, we propose a number of simple communication protocols that can be exploited by deep reinforcement learning to find decentralized control policies in a multi-robot swarm environment. The protocols are based on histograms that encode the local neighborhood relations of the agents and can also transmit task-specific information, such as the shortest distance and direction to a desired target. In our framework, we use an adaptation of Trust Region Policy Optimization to learn complex collaborative tasks, such as formation building, building a communication link, and pushing an intruder. We evaluate our findings in a simulated 2D-physics environment, and compare the implications of different communication protocols.
The goal of this paper is to present an end-to-end, data-driven framework to control Autonomous Mobility-on-Demand systems (AMoD, i.e. fleets of self-driving vehicles). We first model the AMoD system using a time-expanded network, and present a formulation that computes the optimal rebalancing strategy (i.e., preemptive repositioning) and the minimum feasible fleet size for a given travel demand. Then, we adapt this formulation to devise a Model Predictive Control (MPC) algorithm that leverages short-term demand forecasts based on historical data to compute rebalancing strategies. We test the end-to-end performance of this controller with a state-of-the-art LSTM neural network to predict customer demand and real customer data from DiDi Chuxing: we show that this approach scales very well for large systems (indeed, the computational complexity of the MPC algorithm does not depend on the number of customers and of vehicles in the system) and outperforms state-of-the-art rebalancing strategies by reducing the mean customer wait time by up to to 89.6%.
The incorporation of macro-actions (temporally extended actions) into multi-agent decision problems has the potential to address the curse of dimensionality associated with such decision problems. Since macro-actions last for stochastic durations, multiple agents executing decentralized policies in cooperative environments must act asynchronously. We present an algorithm that modifies Generalized Advantage Estimation for temporally extended actions, allowing a state-of-the-art policy optimization algorithm to optimize policies in Dec-POMDPs in which agents act asynchronously. We show that our algorithm is capable of learning optimal policies in two cooperative domains, one involving real-time bus holding control and one involving wildfire fighting with unmanned aircraft. Our algorithm works by framing problems as "event-driven decision processes," which are scenarios where the sequence and timing of actions and events are random and governed by an underlying stochastic process. In addition to optimizing policies with continuous state and action spaces, our algorithm also facilitates the use of event-driven simulators, which do not require time to be discretized into time-steps. We demonstrate the benefit of using event-driven simulation in the context of multiple agents taking asynchronous actions. We show that fixed time-step simulation risks obfuscating the sequence in which closely-separated events occur, adversely affecting the policies learned. Additionally, we show that arbitrarily shrinking the time-step scales poorly with the number of agents.
In this paper, we investigate how to learn to control a group of cooperative agents with limited sensing capabilities such as robot swarms. The agents have only very basic sensor capabilities, yet in a group they can accomplish sophisticated tasks, such as distributed assembly or search and rescue tasks. Learning a policy for a group of agents is difficult due to distributed partial observability of the state. Here, we follow a guided approach where a critic has central access to the global state during learning, which simplifies the policy evaluation problem from a reinforcement learning point of view. For example, we can get the positions of all robots of the swarm using a camera image of a scene. This camera image is only available to the critic and not to the control policies of the robots. We follow an actor-critic approach, where the actors base their decisions only on locally sensed information. In contrast, the critic is learned based on the true global state. Our algorithm uses deep reinforcement learning to approximate both the Q-function and the policy. The performance of the algorithm is evaluated on two tasks with simple simulated 2D agents: 1) finding and maintaining a certain distance to each others and 2) locating a target.
Decentralized control of robots has attracted huge research interests. However, some of the research used unrealistic assumptions without collision avoidance. This report focuses on the collision-free control for multiple robots in both complete coverage and search tasks in 2D and 3D areas which are arbitrary unknown. All algorithms are decentralized as robots have limited abilities and they are mathematically proved. The report starts with the grid selection in the two tasks. Grid patterns simplify the representation of the area and robots only need to move straightly between neighbor vertices. For the 100% complete 2D coverage, the equilateral triangular grid is proposed. For the complete coverage ignoring the boundary effect, the grid with the fewest vertices is calculated in every situation for both 2D and 3D areas. The second part is for the complete coverage in 2D and 3D areas. A decentralized collision-free algorithm with the above selected grid is presented driving robots to sections which are furthest from the reference point. The area can be static or expanding, and the algorithm is simulated in MATLAB. Thirdly, three grid-based decentralized random algorithms with collision avoidance are provided to search targets in 2D or 3D areas. The number of targets can be known or unknown. In the first algorithm, robots choose vacant neighbors randomly with priorities on unvisited ones while the second one adds the repulsive force to disperse robots if they are close. In the third algorithm, if surrounded by visited vertices, the robot will use the breadth-first search algorithm to go to one of the nearest unvisited vertices via the grid. The second search algorithm is verified on Pioneer 3-DX robots. The general way to generate the formula to estimate the search time is demonstrated. Algorithms are compared with five other algorithms in MATLAB to show their effectiveness.
Cooperative motion planning is still a challenging task for robots. Recently, Value Iteration Networks (VINs) were proposed to model motion planning tasks as Neural Networks. In this work, we extend VINs to solve cooperative planning tasks under non-holonomic constraints. For this, we interconnect multiple VINs to pay respect to each other's outputs. Policies for cooperation are generated via iterative gradient descend. Validation in simulation shows that the resulting networks can resolve non-holonomic motion planning problems that require cooperation.
Due to the complexity of the natural world, a programmer cannot foresee all possible situations a connected and autonomous vehicle (CAV) will face during its operation, and hence, CAVs will need to learn to make decisions autonomously. Due to the sensing of its surroundings and information exchanged with other vehicles and road infrastructure a CAV will have access to large amounts of useful data. This paper investigates a data driven driving policy learning framework through an agent based learning. A reinforcement learning framework is presented in the paper, which simulates the self-evolution of a CAV over its lifetime. The results indicated that overtime the CAVs are able to learn useful policies to avoid crashes and achieve its objectives in more efficient ways. Vehicle to vehicle communication in particular, enables additional useful information to be acquired by CAVs, which in turn enables CAVs to learn driving policies more efficiently. The simulation results indicate that while a CAV can learn to make autonomous decision V2V communication of information improves this capability. The future work will investigate complex driving policies such as roundabout negotiations, cooperative learning between CAVs and deep reinforcement learning to traverse larger state spaces.
The development and deployment of Autonomous Vehicles (AVs) on our roads is not only realistic in the near future but can also bring significant benefits. In particular, it can potentially solve several problems relating to vehicles and traffic, for instance: (i) possible reduction of traffic congestion, with the consequence of improved fuel economy and reduced driver inactivity; (ii) possible reduction in the number of accidents, assuming that an AV can minimise the human errors that often cause traffic accidents; and (iii) increased ease of parking, especially when one considers the potential for shared AVs. In order to deploy an AV there are significant steps that must be completed in terms of hardware and software. As expected, software components play a key role in the complex AV system and so, at least for safety, we should assess the correctness of these components. In this paper, we are concerned with the high-level software component(s) responsible for the decisions in an AV. We intend to model an AV capable of navigation; obstacle avoidance; obstacle selection (when a crash is unavoidable) and vehicle recovery, etc, using a rational agent. To achieve this, we have established the following stages. First, the agent plans and actions have been implemented within the Gwendolen agent programming language. Second, we have built a simulated automotive environment in the Java language. Third, we have formally specified some of the required agent properties through LTL formulae, which are then formally verified with the AJPF verification tool. Finally, within the MCAPL framework (which comprises all the tools used in previous stages) we have obtained formal verification of our AV agent in terms of its specific behaviours. For example, the agent plans responsible for selecting an obstacle with low potential damage, instead of a higher damage obstacle (when possible) can be formally verified within MCAPL. We must emphasise that the major goal (of our present approach) lies in the formal verification of agent plans, rather than evaluating real-world applications. For this reason we utilised a simple matrix representation concerning the environment used by our agent.
We address a problem of area protection in graph-based scenarios with multiple mobile agents where connectivity is maintained among agents to ensure they can communicate. The problem consists of two adversarial teams of agents that move in an undirected graph shared by both teams. Agents are placed in vertices of the graph; at most one agent can occupy a vertex; and they can move into adjacent vertices in a conflict free way. Teams have asymmetric goals: the aim of one team - attackers - is to invade into given area while the aim of the opponent team - defenders - is to protect the area from being entered by attackers by occupying selected vertices. The team of defenders need to maintain connectivity of vertices occupied by its own agents in a visibility graph. The visibility graph models possibility of communication between pairs of vertices. We study strategies for allocating vertices to be occupied by the team of defenders to block attacking agents where connectivity is maintained at the same time. To do this we reserve a subset of defending agents that do not try to block the attackers but instead are placed to support connectivity of the team. The performance of strategies is tested in multiple benchmarks. The success of a strategy is heavily dependent on the type of the instance, and so one of the contributions of this work is that we identify suitable strategies for diverse instance types.
This paper presents a model-free reinforcement learning (RL) based distributed control protocol for leader-follower multi-agent systems. Although RL has been successfully used to learn optimal control protocols for multi-agent systems, the effects of adversarial inputs are ignored. It is shown in this paper, however, that their adverse effects can propagate across the network and impact the learning outcome of other intact agents. To alleviate this problem, a unified RL-based distributed control frameworks is developed for both homogeneous and heterogeneous multi-agent systems to prevent corrupted sensory data from propagating across the network. To this end, only the leader communicates its actual sensory information and other agents estimate the leader state using a distributed observer and communicate this estimation to their neighbors to achieve consensus on the leader state. The observer cannot be physically affected by any adversarial input. To further improve resiliency, distributed H-infinity control protocols are designed to attenuate the effect of the adversarial inputs on the compromised agent itself. An off-policy RL algorithm is developed to learn the solutions of the game algebraic Riccati equations arising from solving the H-infinity control problem. No knowledge of the agent dynamics is required and it is shown that the proposed RL-based H-infinity control protocol is resilient against adversarial inputs.
We study the problem of allocating impressions to sellers in e-commerce websites, such as Amazon, eBay or Taobao, aiming to maximize the total revenue generated by the platform. When a buyer searches for a keyword, the website presents the buyer with a list of different sellers for this item, together with the corresponding prices. This can be seen as an instance of a resource allocation problem in which the sellers choose their prices at each step and the platform decides how to allocate the impressions, based on the chosen prices and the historical transactions of each seller. Due to the complexity of the system, most e-commerce platforms employ heuristic allocation algorithms that mainly depend on the sellers' transaction records and without taking the rationality of the sellers into account, which makes them susceptible to price manipulations. In this paper, we put forward a general framework of reinforcement mechanism design, which uses deep reinforcement learning to design efficient algorithms, taking the strategic behaviour of the sellers into account. We apply the framework to the problem of allocating impressions to sellers in large e-commerce websites, a problem which is modeled as a Markov decision process, where the states encode the history of impressions, prices, transactions and generated revenue and the actions are the possible impression allocations at each round. To tackle the problem of continuity and high-dimensionality of states and actions, we adopt the ideas of the DDPG algorithm to design an actor-critic gradient policy algorithm which takes advantage of the problem domain in order to achieve convergence and stability. Our algorithm is compared against natural heuristics and it outperforms all of them in terms of the total revenue generated. Finally, contrary to the DDPG algorithm, our algorithm is robust to settings with variable sellers and easy to converge.
We model the spread of news as a social learning game on a network. Agents can either endorse or oppose a claim made in a piece of news, which itself may be either true or false. Agents base their decision on a private signal and their neighbors' past actions. Given these inputs, agents follow strategies derived via multi-agent deep reinforcement learning and receive utility from acting in accordance with the veracity of claims. Our framework yields strategies with agent utility close to a theoretical, Bayes optimal benchmark, while remaining flexible to model re-specification. Optimized strategies allow agents to correctly identify most false claims, when all agents receive unbiased private signals. However, an adversary's attempt to spread fake news by targeting a subset of agents with a biased private signal can be successful. Even more so when the adversary has information about agents' network position or private signal. When agents are aware of the presence of an adversary they re-optimize their strategies in the training stage and the adversary's attack is less effective. Hence, exposing agents to the possibility of fake news can be an effective way to curtail the spread of fake news in social networks. Our results also highlight that information about the users' private beliefs and their social network structure can be extremely valuable to adversaries and should be well protected.
The beer game is a decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain network in which agents cooperatively attempt to minimize the total cost of the network even though each agent can only observe its own local information. We develop a variant of the Deep Q-Network algorithm to solve this problem. Extensive numerical experiment show the effectiveness of our algorithm. Unlike most algorithms in literature, our algorithm does not have any limits on the parameter values, and it provides good solutions even if the agents do not follow a rational policy. The algorithm can be extended to other decentralized multi-agent cooperative games with partially observed information, which is a common type of situation in supply chain problems.
Agent-Based Computing is a diverse research domain concerned with the building of intelligent software based on the concept of "agents". In this paper, we use Scientometric analysis to analyze all sub-domains of agent-based computing. Our data consists of 1,064 journal articles indexed in the ISI web of knowledge published during a twenty year period: 1990-2010. These were retrieved using a topic search with various keywords commonly used in sub-domains of agent-based computing. In our proposed approach, we have employed a combination of two applications for analysis, namely Network Workbench and CiteSpace - wherein Network Workbench allowed for the analysis of complex network aspects of the domain, detailed visualization-based analysis of the bibliographic data was performed using CiteSpace. Our results include the identification of the largest cluster based on keywords, the timeline of publication of index terms, the core journals and key subject categories. We also identify the core authors, top countries of origin of the manuscripts along with core research institutes. Finally, our results have interestingly revealed the strong presence of agent-based computing in a number of non-computing related scientific domains including Life Sciences, Ecological Sciences and Social Sciences.
Many important stable matching problems are known to be NP-hard, even when strong restrictions are placed on the input. In this paper we seek to identify simple structural properties of instances of stable matching problems which will allow the design of efficient algorithms. We focus on the setting in which all agents involved in some matching problem can be partitioned into k different types, where the type of an agent determines his or her preferences, and agents have preferences over types (which may be refined by more detailed preferences within a single type). This situation could arise in practice if agents form preferences based on some small collection of agents' attributes. The notion of types could also be used if we are interested in a relaxation of stability, in which agents will only form a private arrangement if it allows them to be matched with a partner who differs from the current partner in some particularly important characteristic. We show that in this setting several well-studied NP-hard stable matching problems (such as MAX SMTI, MAX SRTI, and MAX SIZE MIN BP SMTI) belong to the parameterised complexity class FPT when parameterised by the number of different types of agents, and so admit efficient algorithms when this number of types is small.
In this paper, we propose a novel distributed formation control strategy, which is based on the measurements of relative position of neighbors, with global orientation estimation in 3-dimensional space. Since agents do not share a common reference frame, orientations of the local reference frame are not aligned with each other. Under the orientation estimation law, a rotation matrix that identifies orientation of local frame with respect to a common frame is obtained by auxiliary variables. The proposed strategy includes a combination of global orientation estimation and formation control law. Since orientation of each agent is estimated in the global sense, formation control strategy ensures that the formation globally exponentially converges to the desired formation in 3-dimensional space.
Rear end collisions are deadliest in nature and cause most of traffic casualties and injuries. In the existing research, many rear end collision avoidance solutions have been proposed. However, the problem with these proposed solutions is that they are highly dependent on precise mathematical models. Whereas, the real road driving is influenced by non-linear factors such as road surface situations, driver reaction time, pedestrian flow and vehicle dynamics, hence obtaining the accurate mathematical model of the vehicle control system is challenging. This problem with precise control based rear end collision avoidance schemes has been addressed using fuzzy logic, but the excessive number of fuzzy rules straightforwardly prejudice their efficiency. Furthermore, these fuzzy logic based controllers have been proposed without using proper agent based modeling that helps in mimicking the functions of an artificial human driver executing these fuzzy rules. Keeping in view these limitations, we have proposed an Enhanced Emotion Enabled Cognitive Agent (EEEC_Agent) based controller that helps the Autonomous Vehicles (AVs) to perform rear end collision avoidance with less number of rules, designed after fear emotion, and high efficiency. To introduce a fear emotion generation mechanism in EEEC_Agent, Orton, Clore & Collins (OCC) model has been employed. The fear generation mechanism of EEEC_Agent has been verified using NetLogo simulation. Furthermore, practical validation of EEEC_Agent functions has been performed using specially built prototype AV platform. Eventually, the qualitative comparative study with existing state of the art research works reflect that proposed model outperforms recent research.
Blind spots are one of the causes of road accidents in the hilly and flat areas. These blind spot accidents can be decreased by establishing an Internet of Vehicles (IoV) using Vehicle-2-Vehicle (V2V) and Vehicle-2-Infrastrtructure (V2I) communication systems. But the problem with these IoV is that most of them are using DSRC or single Radio Access Technology (RAT) as a wireless technology, which has been proven to be failed for efficient communication between vehicles. Recently, Cognitive Radio (CR) based IoV have to be proven best wireless communication systems for vehicular networks. However, the spectrum mobility is a challenging task to keep CR based vehicular networks interoperable and has not been addressed sufficiently in existing research. In our previous research work, the Cognitive Radio Site (CR-Site) has been proposed as in-vehicle CR-device, which can be utilized to establish efficient IoV systems. H In this paper, we have introduced the Emotions Inspired Cognitive Agent (EIC_Agent) based spectrum mobility mechanism in CR-Site and proposed a novel emotions controlled spectrum mobility scheme for efficient syntactic interoperability between vehicles. For this purpose, a probabilistic deterministic finite automaton using fear factor is proposed to perform efficient spectrum mobility using fuzzy logic. In addition, the quantitative computation of different fear intensity levels has been performed with the help of fuzzy logic. The system has been tested using active data from different GSM service providers on Mangla-Mirpur road. This is supplemented by extensive simulation experiments which validate the proposed scheme for CR based high-speed vehicular networks. The qualitative comparison with the existing-state-of the-art has proven the superiority of the proposed emotions controlled syntactic interoperable spectrum mobility scheme within cognitive radio based IoV systems.
Humans are going to delegate the rights of driving to the autonomous vehicles in near future. However, to fulfill this complicated task, there is a need for a mechanism, which enforces the autonomous vehicles to obey the road and social rules that have been practiced by well-behaved drivers. This task can be achieved by introducing social norms compliance mechanism in the autonomous vehicles. This research paper is proposing an artificial society of autonomous vehicles as an analogy of human social society. Each AV has been assigned a social personality having different social influence. Social norms have been introduced which help the AVs in making the decisions, influenced by emotions, regarding road collision avoidance. Furthermore, social norms compliance mechanism, by artificial social AVs, has been proposed using prospect based emotion i.e. fear, which is conceived from OCC model. Fuzzy logic has been employed to compute the emotions quantitatively. Then, using SimConnect approach, fuzzy values of fear has been provided to the Netlogo simulation environment to simulate artificial society of AVs. Extensive testing has been performed using the behavior space tool to find out the performance of the proposed approach in terms of the number of collisions. For comparison, the random-walk model based artificial society of AVs has been proposed as well. A comparative study with a random walk, prove that proposed approach provides a better option to tailor the autopilots of future AVS, Which will be more socially acceptable and trustworthy by their riders in terms of safe road travel.
Making roads safer by avoiding road collisions is one of the main reasons for inventing Autonomous vehicles (AVs). In this context, designing agent-based collision avoidance components of AVs which truly represent human cognition and emotions look is a more feasible approach as agents can replace human drivers. However, to the best of our knowledge, very few human emotion and cognition-inspired agent-based studies have previously been conducted in this domain. Furthermore, these agent-based solutions have not been validated using any key validation technique. Keeping in view this lack of validation practices, we have selected state-of-the-art Emotion Enabled Cognitive Agent (EEC_Agent), which was proposed to avoid lateral collisions between semi-AVs. The architecture of EEC_Agent has been revised using Exploratory Agent Based Modeling (EABM) level of the Cognitive Agent Based Computing (CABC) framework and real-time fear emotion generation mechanism using the Ortony, Clore & Collins (OCC) model has also been introduced. Then the proposed fear generation mechanism has been validated using the Validated Agent Based Modeling level of CABC framework using a Virtual Overlay MultiAgent System (VOMAS). Extensive simulation and practical experiments demonstrate that the Enhanced EEC_Agent exhibits the capability to feel different levels of fear, according to different traffic situations and also needs a smaller Stopping Sight Distance (SSD) and Overtaking Sight Distance (OSD) as compared to human drivers.
Agent-based modeling and simulation tools provide a mature platform for development of complex simulations. They however, have not been applied much in the domain of mainstream modeling and simulation of computer networks. In this article, we evaluate how and if these tools can offer any value-addition in the modeling & simulation of complex networks such as pervasive computing, large-scale peer-to-peer systems, and networks involving considerable environment and human/animal/habitat interaction. Specifically, we demonstrate the effectiveness of NetLogo - a tool that has been widely used in the area of agent-based social simulation.
Threshold models and their dynamics may be used to model the spread of `behaviors' in social networks. Regarding such from a modal logical perspective, it is shown how standard update mechanisms may be emulated using action models -- graphs encoding agents' decision rules. A small class of action models capturing the possible sets of decision rules suitable for threshold models is identified, and shown to include models characterizing best-response dynamics of both coordination and anti-coordination games played on graphs.
The 9th Semantic Ambient Media Experience (SAME) proceedings where based on the academic contributions to a two day workshop that was held at Curtin University, Perth, WA, Australia. The symposium was held to discuss visualisation, emerging media, and user-experience from various angles. The papers of this workshop are freely available through http://www.ambientmediaassociation.org/Journal under open access as provided by the International Ambient Media Association (iAMEA) Ry. iAMEA is hosting the international open access journal entitled "International Journal on Information Systems and Management in Creative eMedia", and the series entitled "International Series on Information Systems and Management in Creative eMedia". For any further information, please visit the website of the Association: http://www.ambientmediaassociation.org.
Multi-sensor state space models underpin fusion applications in networks of sensors. Estimation of latent parameters in these models has the potential to provide highly desirable capabilities such as network self-calibration. Conventional solutions to the problem pose difficulties in scaling with the number of sensors due to the joint multi-sensor filtering involved when evaluating the parameter likelihood. In this article, we propose a separable pseudo-likelihood which is a more accurate approximation compared to a previously proposed alternative under typical operating conditions. In addition, we consider using separable likelihoods in the presence of many objects and ambiguity in associating measurements with objects that originated them. To this end, we use a state space model with a hypothesis based parameterisation, and, develop an empirical Bayesian perspective in order to evaluate separable likelihoods on this model using local filtering. Bayesian inference with this likelihood is carried out using belief propagation on the associated pairwise Markov random field. We specify a particle algorithm for latent parameter estimation in a linear Gaussian state space model and demonstrate its efficacy for network self-calibration using measurements from non-cooperative targets in comparison with alternatives.
The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research.
We propose a novel computational method to extract information about interactions among individuals with different behavioral states in a biological collective from ordinary video recordings. Assuming that individuals are acting as finite state machines, our method first detects discrete behavioral states of those individuals and then constructs a model of their state transitions, taking into account the positions and states of other individuals in the vicinity. We have tested the proposed method through applications to two real-world biological collectives, termites in an experimental setting and human pedestrians in an open space. For each application, a robust tracking system was developed in-house, utilizing interactive human intervention (for termite tracking) or online agent-based simulation (for pedestrian tracking). In both cases, significant interactions were detected between nearby individuals with different states, demonstrating the effectiveness of the proposed method.
Humanity faces numerous problems of common-pool resource appropriation. This class of multi-agent social dilemma includes the problems of ensuring sustainable use of fresh water, common fisheries, grazing pastures, and irrigation systems. Abstract models of common-pool resource appropriation based on non-cooperative game theory predict that self-interested agents will generally fail to find socially positive equilibria---a phenomenon called the tragedy of the commons. However, in reality, human societies are sometimes able to discover and implement stable cooperative solutions. Decades of behavioral game theory research have sought to uncover aspects of human behavior that make this possible. Most of that work was based on laboratory experiments where participants only make a single choice: how much to appropriate. Recognizing the importance of spatial and temporal resource dynamics, a recent trend has been toward experiments in more complex real-time video game-like environments. However, standard methods of non-cooperative game theory can no longer be used to generate predictions for this case. Here we show that deep reinforcement learning can be used instead. To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling common-pool resource appropriation. Our experiments highlight the importance of trial-and-error learning in common-pool resource appropriation and shed light on the relationship between exclusion, sustainability, and inequality.
A significant amount of research in recent years has been dedicated towards single agent deep reinforcement learning. Much of the success of deep reinforcement learning can be attributed towards the use of experience replay memories within which state transitions are stored. Function approximation methods such as convolutional neural networks (referred to as deep Q-Networks, or DQNs, in this context) can subsequently be trained through sampling the stored transitions. However, considerations are required when using experience replay memories within multi-agent systems, as stored transitions can become outdated due to agents updating their respective policies in parallel . In this work we apply leniency  to multi-agent deep reinforcement learning (MA-DRL), acting as a control mechanism to determine which state-transitions sampled are allowed to update the DQN. Our resulting Lenient-DQN (LDQN) is evaluated using variations of the Coordinated Multi-Agent Object Transportation Problem (CMOTP) outlined by Busoniu et al. . The LDQN significantly outperforms the existing hysteretic DQN (HDQN)  within environments that yield stochastic rewards. Based on results from experiments conducted using vanilla and double Q-learning versions of the lenient and hysteretic algorithms, we advocate a hybrid approach where learners initially use vanilla Q-learning before transitioning to double Q-learners upon converging on a cooperative joint policy.
In social dilemmas individuals face a temptation to increase their payoffs in the short run at a cost to the long run total welfare. Much is known about how cooperation can be stabilized in the simplest of such settings: repeated Prisoner's Dilemma games. However, there is relatively little work on generalizing these insights to more complex situations. Solving this problem is extremely important if we wish to construct artificially intelligent agents that interact with each other and humans in real world scenarios. We show how to use modern reinforcement learning methods to generalize a highly successful Prisoner's Dilemma strategy: tit-for-tat. We construct agents that act in ways that are simple to understand, nice (begin by cooperating), provokable (try to avoid being exploited), and forgiving (following a bad turn try to return to mutual cooperation). We show both theoretically and experimentally that such agents can maintain cooperation in more complex environments. We also show that such agents teach simple reactive agents to be cooperative. In contrast, we show that standard 'reactive self-play' paradigms, which have made great strides in zero-sum games, may not be enough for constructing agents that maintain cooperation in positive-sum interactions.
This paper deals with the problem of properly simulating the Internet of Things (IoT). Simulating an IoT allows evaluating strategies that can be employed to deploy smart services over different kinds of territories. However, the heterogeneity of scenarios seriously complicates this task. This imposes the use of sophisticated modeling and simulation techniques. We discuss novel approaches for the provision of scalable simulation scenarios, that enable the real-time execution of massively populated IoT environments. Attention is given to novel hybrid and multi-level simulation techniques that, when combined with agent-based, adaptive Parallel and Distributed Simulation (PADS) approaches, can provide means to perform highly detailed simulations on demand. To support this claim, we detail a use case concerned with the simulation of vehicular transportation systems.
Finding asymptotically-optimal paths in multi-robot motion planning problems could be achieved, in principle, using sampling-based planners in the composite configuration space of all of the robots in the space. The dimensionality of this space increases with the number of robots, rendering this approach impractical. This work focuses on a scalable sampling-based planner for coupled multi-robot problems that provides asymptotic optimality. It extends the dRRT approach, which proposed building roadmaps for each robot and searching an implicit roadmap in the composite configuration space. This work presents a new method, dRRT* , and develops theory for scalable convergence to optimal paths in multi-robot problems. Simulated experiments indicate dRRT* converges to high-quality paths while scaling to higher numbers of robots where the naive approach fails. Furthermore, dRRT* is applicable to high-dimensional problems, such as planning for robot manipulators
Trust models are widely used in various computer science disciplines. The main purpose of a trust model is to continuously measure trustworthiness of a set of entities based on their behaviors. In this article, the novel notion of "rational trust modeling" is introduced by bridging trust management and game theory. Note that trust models/reputation systems have been used in game theory (e.g., repeated games) for a long time, however, game theory has not been utilized in the process of trust model construction; this is where the novelty of our approach comes from. In our proposed setting, the designer of a trust model assumes that the players who intend to utilize the model are rational/selfish, i.e., they decide to become trustworthy or untrustworthy based on the utility that they can gain. In other words, the players are incentivized (or penalized) by the model itself to act properly. The problem of trust management can be then approached by game theoretical analyses and solution concepts such as Nash equilibrium. Although rationality might be built-in in some existing trust models, we intend to formalize the notion of rational trust modeling from the designer's perspective. This approach will result in two fascinating outcomes. First of all, the designer of a trust model can incentivise trustworthiness in the first place by incorporating proper parameters into the trust function, which can be later utilized among selfish players in strategic trust-based interactions (e.g., e-commerce scenarios). Furthermore, using a rational trust model, we can prevent many well-known attacks on trust models. These two prominent properties also help us to predict behavior of the players in subsequent steps by game theoretical analyses.
Before reaching full autonomy, vehicles will gradually be equipped with more and more advanced driver assistance systems (ADAS), effectively rendering them semi-autonomous. However, current ADAS technologies seem unable to handle complex traffic situations, notably when dealing with vehicles arriving from the sides, either at intersections or when merging on highways. The high rate of accidents in these settings prove that they constitute difficult driving situations. Moreover, intersections and merging lanes are often the source of important traffic congestion and, sometimes, deadlocks. In this article, we propose a cooperative framework to safely coordinate semi-autonomous vehicles in such settings, removing the risk of collision or deadlocks while remaining compatible with human driving. More specifically, we present a supervised coordination scheme that overrides control inputs from human drivers when they would result in an unsafe or blocked situation. To avoid unnecessary intervention and remain compatible with human driving, overriding only occurs when collisions or deadlocks are imminent. In this case, safe overriding controls are chosen while ensuring they deviate minimally from those originally requested by the drivers. Simulation results based on a realistic physics simulator show that our approach is scalable to real-world scenarios, and computations can be performed in real-time on a standard computer for up to a dozen simultaneous vehicles.
Blackboard systems are motivated by the popular view of task forces as brainstorming groups in which specialists write promising ideas to solve a problem in a central blackboard. Here we study a minimal model of blackboard system designed to solve cryptarithmetic puzzles, where hints are posted anonymously on a public display (standard blackboard) or are posted together with information about the reputations of the agents that posted them (reputation blackboard). We find that the reputation blackboard always outperforms the standard blackboard, which, in turn, always outperforms the independent search. The asymptotic distribution of the computational cost of the search, which is proportional to the total number of agent updates required to find the solution of the puzzle, is an exponential distribution for those three search heuristics. Only for the reputation blackboard we find a nontrivial dependence of the mean computational cost on the system size and, in that case, the optimal performance is achieved by a single agent working alone, indicating that, though the blackboard organization can produce impressive performance gains when compared with the independent search, it is not very supportive of cooperative work.
Jun 15 2017 cs.MA
We summarise the results of RoboCup 2D Soccer Simulation League in 2016 (Leipzig), including the main competition and the evaluation round. The evaluation round held in Leipzig confirmed the strength of RoboCup-2015 champion (WrightEagle, i.e. WE2015) in the League, with only eventual finalists of 2016 competition capable of defeating WE2015. An extended, post-Leipzig, round-robin tournament which included the top 8 teams of 2016, as well as WE2015, with over 1000 games played for each pair, placed WE2015 third behind the champion team (Gliders2016) and the runner-up (HELIOS2016). This establishes WE2015 as a stable benchmark for the 2D Simulation League. We then contrast two ranking methods and suggest two options for future evaluation challenges. The first one, "The Champions Simulation League", is proposed to include 6 previous champions, directly competing against each other in a round-robin tournament, with the view to systematically trace the advancements in the League. The second proposal, "The Global Challenge", is aimed to increase the realism of the environmental conditions during the simulated games, by simulating specific features of different participating countries.
In the context of solving large distributed constraint optimization problems (DCOP), belief-propagation and approximate inference algorithms are candidates of choice. However, in general, when the factor graph is very loopy (i.e. cyclic), these solution methods suffer from bad performance, due to non-convergence and many exchanged messages. As to improve performances of the Max-Sum inference algorithm when solving loopy constraint optimization problems, we propose here to take inspiration from the belief-propagation-guided dec-imation used to solve sparse random graphs (k-satisfiability). We propose the novel DeciMaxSum method, which is parameterized in terms of policies to decide when to trigger decimation, which variables to decimate, and which values to assign to decimated variables. Based on an empirical evaluation on a classical BP benchmark (the Ising model), some of these combinations of policies exhibit better performance than state-of-the-art competitors.
In this paper, we address the problem of controlling a network of mobile sensors so that a set of hidden states are estimated up to a user-specified accuracy. The sensors take measurements and fuse them online using an Information Consensus Filter (ICF). At the same time, the local estimates guide the sensors to their next best configuration. This leads to an LMI-constrained optimization problem that we solve by means of a new distributed random approximate projections method. The new method is robust to the state disagreement errors that exist among the robots as the ICF fuses the collected measurements. Assuming that the noise corrupting the measurements is zero-mean and Gaussian and that the robots are self localized in the environment, the integrated system converges to the next best positions from where new observations will be taken. This process is repeated with the robots taking a sequence of observations until the hidden states are estimated up to the desired user-specified accuracy. We present simulations of sparse landmark localization, where the robotic team achieves the desired estimation tolerances while exhibiting interesting emergent behavior.
We present a real-time algorithm, SocioSense, for socially-aware navigation of a robot amongst pedestrians. Our approach computes time-varying behaviors of each pedestrian using Bayesian learning and Personality Trait theory. These psychological characteristics are used for long-term path prediction and generating proximic characteristics for each pedestrian. We combine these psychological constraints with social constraints to perform human-aware robot navigation in low- to medium-density crowds. The estimation of time-varying behaviors and pedestrian personalities can improve the performance of long-term path prediction by 21%, as compared to prior interactive path prediction algorithms. We also demonstrate the benefits of our socially-aware navigation in simulated environments with tens of pedestrians.
Jun 06 2017 cs.MA
The efficient use of available resources is a key factor in achieving success on both personal and organizational levels. One of the crucial resources in knowledge economy is time. The ability to force others to adapt to our schedule even if it harms their efficiency can be seen as an outcome of social stratification. The principal objective of this paper is to use time allocation to model and study the global efficiency of social stratification, and to reveal whether hierarchy is an emergent property. A multi-agent model with an evolving social network is used to verify our hypotheses. The network's evolution is driven by the intensity of inter-agent communications, and the communications as such depend on the preferences and time resources of the communicating agents. The entire system is to be perceived as a metaphor of a social network of people regularly filling out agenda for their meetings for a period of time. The overall efficiency of the network of those scheduling agents is measured by the average utilization of the agent's preferences to speak on specific subjects. The simulation results shed light on the effects of different scheduling methods, resource availabilities, and network evolution mechanisms on communication system efficiency. The non-stratified systems show better long-term efficiency. Moreover, in the long term hierarchy disappears in overwhelming majority of cases. Some exceptions are observed for cases where privileges are granted on the basis of node degree weighted by relationship intensities but only in the short term.
In this paper, we develop a distributed intermittent communication and task planning framework for teams of mobile robots. The goal of the robots is to accomplish complex tasks, captured by local Linear Temporal Logic formulas, and share the collected information with all other robots and possibly also with a user. Specifically, we consider situations where the robot communication capabilities are not sufficient to form reliable and connected networks while the robots move to accomplish their tasks. In this case, intermittent communication protocols are necessary that allow the robots to temporarily disconnect from the network in order to accomplish their tasks free of communication constraints. We assume that the robots can only communicate with each other when they meet at common locations in space. Our distributed control framework jointly determines local plans that allow all robots fulfill their assigned temporal tasks, sequences of communication events that guarantee information exchange infinitely often, and optimal communication locations that minimize the total distance traveled by the robots. Simulation and experimental results verify the efficacy of the proposed distributed controllers.
We consider the following control problem on fair allocation of indivisible goods. Given a set $I$ of items and a set of agents, each having strict linear preference over the items, we ask for a minimum subset of the items whose deletion guarantees the existence of a proportional allocation in the remaining instance; we call this problem Proportionality by Item Deletion (PID). Our main result is a polynomial-time algorithm that solves PID for three agents. By contrast, we prove that PID is computationally intractable when the number of agents is unbounded, even if the number $k$ of item deletions allowed is small, since the problem turns out to be W-hard with respect to the parameter $k$. Additionally, we provide some tight lower and upper bounds on the complexity of PID when regarded as a function of $|I|$ and $k$.
The multi-agent path-finding (MAPF) problem has recently received a lot of attention. However, it does not capture important characteristics of many real-world domains, such as automated warehouses, where agents are constantly engaged with new tasks. In this paper, we therefore study a lifelong version of the MAPF problem, called the multi-agent pickup and delivery (MAPD) problem. In the MAPD problem, agents have to attend to a stream of delivery tasks in an online setting. One agent has to be assigned to each delivery task. This agent has to first move to a given pickup location and then to a given delivery location while avoiding collisions with other agents. We present two decoupled MAPD algorithms, Token Passing (TP) and Token Passing with Task Swaps (TPTS). Theoretically, we show that they solve all well-formed MAPD instances, a realistic subclass of MAPD instances. Experimentally, we compare them against a centralized strawman MAPD algorithm without this guarantee in a simulated warehouse system. TP can easily be extended to a fully distributed MAPD algorithm and is the best choice when real-time computation is of primary concern since it remains efficient for MAPD instances with hundreds of agents and tasks. TPTS requires limited communication among agents and balances well between TP and the centralized MAPD algorithm.
Inspired by previous work on emergent language in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal language significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.
May 30 2017 cs.MA
Society has become more dependent on automated intelligent systems, at the same time, these systems have become more and more complicated. Society's expectation regarding the capabilities and intelligence of such systems has also grown. We have become a more complicated society with more complicated problems. As the expectation of intelligent systems rises, we discover many more applications for artificial intelligence. Additionally, as the difficulty level and computational requirements of such problems rise, there is a need to distribute the problem solving. Although the field of multiagent systems (MAS) and distributed artificial intelligence (DAI) is relatively young, the importance and applicability of this technology for solving today's problems continue to grow. In multiagent systems, the main goal is to provide fruitful cooperation among agents in order to enrich the support given to all user activities. This paper deals with the development of a multiagent system aimed at solving the reservation problems encountered in rural tourism. Due to their benefits over the last few years, online travel agencies have become a very useful instrument in planning vacations. A MAS concept (which is based on the Internet exploitation) can improve this activity and provide clients with a new, rapid and efficient way of making accommodation arrangements.
The existence of a coalition strategy to achieve a goal does not necessarily mean that the coalition has enough information to know how to follow the strategy. Neither does it mean that the coalition knows that such a strategy exists. The article studies an interplay between the distributed knowledge, coalition strategies, and coalition "know-how" strategies. The main technical result is a sound and complete trimodal logical system that describes the properties of this interplay.
Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
This paper explores the use of Answer Set Programming (ASP) in solving Distributed Constraint Optimization Problems (DCOPs). The paper provides the following novel contributions: (1) It shows how one can formulate DCOPs as logic programs; (2) It introduces ASP-DPOP, the first DCOP algorithm that is based on logic programming; (3) It experimentally shows that ASP-DPOP can be up to two orders of magnitude faster than DPOP (its imperative programming counterpart) as well as solve some problems that DPOP fails to solve, due to memory limitations; and (4) It demonstrates the applicability of ASP in a wide array of multi-agent problems currently modeled as DCOPs. Under consideration in Theory and Practice of Logic Programming (TPLP).
In this paper we describe a simulation platform that supports studies on the impact of crime on urban mobility. We present an example of how this can be achieved by seeking to understand the effect, on the transport system, if users of this system decide to choose optimal routes of time between origins and destinations that normally follow. Based on real data from a large Brazilian metropolis, we found that the percentage of users who follow this policy is small. Most prefer to follow less efficient routes by making bus exchanges at terminals. This can be understood as an indication that the users of the transport system privilege the security factor.