In this paper, we conduct an empirical study on discovering the ordered collective dynamics obtained by a population of artificial intelligence (AI) agents. Our intention is to put AI agents into a simulated natural context, and then to understand their induced dynamics at the population level. In particular, we aim to verify if the principles developed in the real world could also be used in understanding an artificially-created intelligent population. To achieve this, we simulate a large-scale predator-prey world, where the laws of the world are designed by only the findings or logical equivalence that have been discovered in nature. We endow the agents with the intelligence based on deep reinforcement learning, and scale the population size up to millions. Our results show that the population dynamics of AI agents, driven only by each agent's individual self interest, reveals an ordered pattern that is similar to the Lotka-Volterra model studied in population biology. We further discover the emergent behaviors of collective adaptations in studying how the agents' grouping behaviors will change with the environmental resources. Both of the two findings could be explained by the self-organization theory in nature.
In this paper, we investigate how to learn to control a group of cooperative agents with limited sensing capabilities such as robot swarms. The agents have only very basic sensor capabilities, yet in a group they can accomplish sophisticated tasks, such as distributed assembly or search and rescue tasks. Learning a policy for a group of agents is difficult due to distributed partial observability of the state. Here, we follow a guided approach where a critic has central access to the global state during learning, which simplifies the policy evaluation problem from a reinforcement learning point of view. For example, we can get the positions of all robots of the swarm using a camera image of a scene. This camera image is only available to the critic and not to the control policies of the robots. We follow an actor-critic approach, where the actors base their decisions only on locally sensed information. In contrast, the critic is learned based on the true global state. Our algorithm uses deep reinforcement learning to approximate both the Q-function and the policy. The performance of the algorithm is evaluated on two tasks with simple simulated 2D agents: 1) finding and maintaining a certain distance to each others and 2) locating a target.
Decentralized control of robots has attracted huge research interests. However, some of the research used unrealistic assumptions without collision avoidance. This report focuses on the collision-free control for multiple robots in both complete coverage and search tasks in 2D and 3D areas which are arbitrary unknown. All algorithms are decentralized as robots have limited abilities and they are mathematically proved. The report starts with the grid selection in the two tasks. Grid patterns simplify the representation of the area and robots only need to move straightly between neighbor vertices. For the 100% complete 2D coverage, the equilateral triangular grid is proposed. For the complete coverage ignoring the boundary effect, the grid with the fewest vertices is calculated in every situation for both 2D and 3D areas. The second part is for the complete coverage in 2D and 3D areas. A decentralized collision-free algorithm with the above selected grid is presented driving robots to sections which are furthest from the reference point. The area can be static or expanding, and the algorithm is simulated in MATLAB. Thirdly, three grid-based decentralized random algorithms with collision avoidance are provided to search targets in 2D or 3D areas. The number of targets can be known or unknown. In the first algorithm, robots choose vacant neighbors randomly with priorities on unvisited ones while the second one adds the repulsive force to disperse robots if they are close. In the third algorithm, if surrounded by visited vertices, the robot will use the breadth-first search algorithm to go to one of the nearest unvisited vertices via the grid. The second search algorithm is verified on Pioneer 3-DX robots. The general way to generate the formula to estimate the search time is demonstrated. Algorithms are compared with five other algorithms in MATLAB to show their effectiveness.
Cooperative motion planning is still a challenging task for robots. Recently, Value Iteration Networks (VINs) were proposed to model motion planning tasks as Neural Networks. In this work, we extend VINs to solve cooperative planning tasks under non-holonomic constraints. For this, we interconnect multiple VINs to pay respect to each other's outputs. Policies for cooperation are generated via iterative gradient descend. Validation in simulation shows that the resulting networks can resolve non-holonomic motion planning problems that require cooperation.
Due to the complexity of the natural world, a programmer cannot foresee all possible situations a connected and autonomous vehicle (CAV) will face during its operation, and hence, CAVs will need to learn to make decisions autonomously. Due to the sensing of its surroundings and information exchanged with other vehicles and road infrastructure a CAV will have access to large amounts of useful data. This paper investigates a data driven driving policy learning framework through an agent based learning. A reinforcement learning framework is presented in the paper, which simulates the self-evolution of a CAV over its lifetime. The results indicated that overtime the CAVs are able to learn useful policies to avoid crashes and achieve its objectives in more efficient ways. Vehicle to vehicle communication in particular, enables additional useful information to be acquired by CAVs, which in turn enables CAVs to learn driving policies more efficiently. The simulation results indicate that while a CAV can learn to make autonomous decision V2V communication of information improves this capability. The future work will investigate complex driving policies such as roundabout negotiations, cooperative learning between CAVs and deep reinforcement learning to traverse larger state spaces.
In this paper, the problem of energy trading between smart grid prosumers, who can simultaneously consume and produce energy, and a grid power company is studied. The problem is formulated as a single-leader, multiple-follower Stackelberg game between the power company and multiple prosumers. In this game, the power company acts as a leader who determines the pricing strategy that maximizes its profits, while the prosumers act as followers who react by choosing the amount of energy to buy or sell so as to optimize their current and future profits. The proposed game accounts for each prosumer's subjective decision when faced with the uncertainty of profits, induced by the random future price. In particular, the framing effect, from the framework of prospect theory (PT), is used to account for each prosumer's valuation of its gains and losses with respect to an individual utility reference point. The reference point changes between prosumers and stems from their past experience and future aspirations of profits. The followers' noncooperative game is shown to admit a unique pure-strategy Nash equilibrium (NE) under classical game theory (CGT) which is obtained using a fully distributed algorithm. The results are extended to account for the case of PT using algorithmic solutions that can achieve an NE under certain conditions. Simulation results show that the total grid load varies significantly with the prosumers' reference point and their loss-aversion level. In addition, it is shown that the power company's profits considerably decrease when it fails to account for the prosumers' subjective perceptions under PT.
We study the problem of distributed maximum computation in an open multi-agent system, where agents can leave and arrive during the execution of the algorithm. The main challenge comes from the possibility that the agent holding the largest value leaves the system, which changes the value to be computed. The algorithms must as a result be endowed with mechanisms allowing to forget outdated information. The focus is on systems in which interactions are pairwise gossips between randomly selected agents. We consider situations where leaving agents can send a last message, and situations where they cannot. For both cases, we provide algorithms able to eventually compute the maximum of the values held by agents.
We consider open multi-agent systems. Unlike the systems usually studied in the literature, here agents may join or leave while the process studied takes place. The system composition and size evolve thus with time. We focus here on systems where the interactions between agents lead to pairwise gossip averages, and where agents either arrive or are replaced at random times. These events prevent any convergence of the system. Instead, we describe the expected system behavior by showing that the evolution of scaled moments of the state can be characterized by a 2-dimensional (possibly time-varying) linear dynamical system. We apply this technique to two cases : (i) systems with fixed size where leaving agents are immediately replaced, and (ii) systems where new agents keep arriving without ever leaving, and whose size grows thus unbounded.
In crowdfunding, an entrepreneur often has to decide how to disclose the campaign status in order to collect as many contributions as possible. We propose information design as a tool to help the entrepreneur to improve revenue by influencing backers' beliefs. We introduce a heuristic algorithm to dynamically compute information-disclosure policies for the entrepreneur, followed by an empirical evaluation to demonstrate its competitiveness over the widely-adopted immediate-disclosure policy. Our work sheds light on information design in a dynamic setting where agents follow thresholding policies.
Single time-scale distributed estimation of dynamic systems via a network of sensors/estimators is addressed in this letter. In single time-scale distributed estimation, the two fusion steps, consensus and measurement exchange, are implemented only once, in contrast to, e.g., a large number of consensus iterations at every step of the system dynamics. We particularly discuss the problem of failure in the sensor/estimator network and how to recover for distributed estimation by adding new sensor measurements from equivalent states. We separately discuss the recovery for two types of sensors, namely \alpha and \beta sensors. We propose polynomial order algorithms to find equivalent state nodes in graph representation of system to recover for distributed observability. The polynomial order solution is particularly significant for large-scale systems.
Observability of complex systems/networks is the focus of this paper, which is shown to be closely related to the concept of contraction. Indeed, for observable network tracking it is necessary/sufficient to have one node in each contraction measured. Therefore, nodes in a contraction are equivalent to recover for loss of observability, implying that contraction size is a key factor for observability recovery. Here, using a polynomial order contraction detection algorithm, we analyze the distribution of contractions, studying its relation with key network properties. Our results show that contraction size is related to network clustering coefficient and degree heterogeneity. Particularly, in networks with power-law degree distribution, if the clustering coefficient is high there are less contractions with smaller size on average. The implication is that estimation/tracking of such systems requires less number of measurements, while their observational recovery is more restrictive in case of sensor failure. Further, in Small-World networks higher degree heterogeneity implies that there are more contractions with smaller size on average. Therefore, the estimation of representing system requires more measurements, and also the recovery of measurement failure is more limited. These results imply that one can tune the properties of synthetic networks to alleviate their estimation/observability recovery.