Social media platforms provide an environment where people can freely engage in discussions. Unfortunately, they also enable several problems, such as online harassment. Recently, Google and Jigsaw started a project called Perspective, which uses machine learning to automatically detect toxic language. A demonstration website has been also launched, which allows anyone to type a phrase in the interface and instantaneously see the toxicity score . In this paper, we propose an attack on the Perspective toxic detection system based on the adversarial examples. We show that an adversary can subtly modify a highly toxic phrase in a way that the system assigns significantly lower toxicity score to it. We apply the attack on the sample phrases provided in the Perspective website and show that we can consistently reduce the toxicity scores to the level of the non-toxic phrases. The existence of such adversarial examples is very harmful for toxic detection systems and seriously undermines their usability.
Social media and data mining are increasingly being used to analyse political and societal issues. Characterisation of users into socio-demographic groups is crucial to improve these analyses. Here we undertake the classification of social media users as supporting or opposing ongoing independence movements in their territories. Independence movements occur in territories whose citizens have conflicting national identities; users with opposing national identities will then support or oppose the sense of being part of an independent nation that differs from the officially recognised country. We describe a methodology that relies on users' self-reported location to build datasets for three territories -- Catalonia, the Basque Country and Scotland -- and we test language-independent classifiers using four types of features. We show the effectiveness of the approach to build large annotated datasets, and the ability to achieve accurate, language-independent classification performances ranging from 85% to 97% for the three territories under study.
Through seven publications this dissertation shows how anonymized mobile phone data can contribute to the social good and provide insights into human behaviour on a large scale. The size of the datasets analysed ranges from 500 million to 300 billion phone records, covering millions of people. The key contributions are two-fold: 1. Big Data for Social Good: Through prediction algorithms the results show how mobile phone data can be useful to predict important socio-economic indicators, such as income, illiteracy and poverty in developing countries. Such knowledge can be used to identify where vulnerable groups in society are, reduce economic shocks and is a critical component for monitoring poverty rates over time. Further, the dissertation demonstrates how mobile phone data can be used to better understand human behaviour during large shocks in society, exemplified by an analysis of data from the terror attack in Norway and a natural disaster on the south-coast in Bangladesh. This work leads to an increased understanding of how information spreads, and how millions of people move around. The intention is to identify displaced people faster, cheaper and more accurately than existing survey-based methods. 2. Big Data for efficient marketing: Finally, the dissertation offers an insight into how anonymised mobile phone data can be used to map out large social networks, covering millions of people, to understand how products spread inside these networks. Results show that by including social patterns and machine learning techniques in a large-scale marketing experiment in Asia, the adoption rate is increased by 13 times compared to the approach used by experienced marketers. A data-driven and scientific approach to marketing, through more tailored campaigns, contributes to less irrelevant offers for the customers, and better cost efficiency for the companies.
Feb 28 2017 cs.SI
Selfies have become increasingly fashionable in the social media era. People are willing to share their selfies in various social media platforms such as Facebook, Instagram and Flicker. The popularity of selfie have caught researchers' attention, especially psychologists. In computer vision and machine learning areas, little attention has been paid to this phenomenon as a valuable data source. In this paper, we focus on exploring the deeper personal patterns behind people's different kinds of selfie-posting behaviours. We develop this work based on a dataset of WeChat, one of the most extensively used instant messaging platform in China. In particular, we first propose an unsupervised approach to classify the images posted by users. Based on the classification result, we construct three types of user-level features that reflect user preference, activity and posting habit. Based on these features, for a series of selfie related tasks, we build classifiers that can accurately predict two sets of users with opposite selfie-posting behaviours. We have found that people's interest, activity and posting habit have a great influence on their selfie-posting behaviours. For example, the classification accuracy between selfie-posting addict and nonaddict reaches 89.36%. We also prove that using user's image information to predict these behaviours achieve better performance than using text information. More importantly, for each set of users with a specific selfie-posting behaviour, we extract and visualize significant personal patterns about them. In addition, we cluster users and extract their high-level attributes, revealing the correlation between these attributes and users' selfie-posting behaviours. In the end, we demonstrate that users' selfie-posting behaviour, as a good predictor, could predict their different preferences toward these high-level attributes accurately.
Retaining players over an extended period of time is a long-standing challenge in game industry. Significant effort has been paid to understanding what motivates players enjoy games. While individuals may have varying reasons to play or abandon a game at different stages within the game, previous studies have looked at the retention problem from a snapshot view. This study, by analyzing in-game logs of 51,104 distinct individuals in an online multiplayer game, uniquely offers a multifaceted view of the retention problem over the players' virtual life phases. We find that key indicators of longevity change with the game level. Achievement features are important for players at the initial to the advanced phases, yet social features become the most predictive of longevity once players reach the highest level offered by the game. These findings have theoretical and practical implications for designing online games that are adaptive to meeting the players' needs.
In this paper, we propose a novel generic model of opinion dynamics over a social network, in the presence of communication among the users leading to interpersonal influence i.e., peer pressure. Each individual in the social network has a distinct objective function representing a weighted sum of internal and external pressures. We prove conditions under which a connected group of users converges to a fixed opinion distribution, and under which conditions the group reaches consensus. Through simulation, we study the rate of convergence on large scale-free networks as well as the impact of user stubbornness on convergence in a simple political model.
Over the past few years, online aggression and abusive behaviors have occurred in many different forms and on a variety of platforms. In extreme cases, these incidents have evolved into hate, discrimination, and bullying, and even materialized into real-world threats and attacks against individuals or groups. In this paper, we study the Gamergate controversy. Started in August 2014 in the online gaming world, it quickly spread across various social networking platforms, ultimately leading to many incidents of cyberbullying and cyberaggression. We focus on Twitter, presenting a measurement study of a dataset of 340k unique users and 1.6M tweets to study the properties of these users, the content they post, and how they differ from random Twitter users. We find that users involved in this "Twitter war" tend to have more friends and followers, are generally more engaged and post tweets with negative sentiment, less joy, and more hate than random users. We also perform preliminary measurements on how the Twitter suspension mechanism deals with such abusive behaviors. While we focus on Gamergate, our methodology to collect and analyze tweets related to aggressive and bullying activities is of independent interest.
For any stream of time-stamped edges that form a dynamic network, an important choice is the aggregation granularity that an analyst uses to bin the data. Picking such a windowing of the data is often done by hand, or left up to the technology that is collecting the data. However, the choice can make a big difference in the properties of the dynamic network. This is the time scale detection problem. In previous work, this problem is often solved with a heuristic as an unsupervised task. As an unsupervised problem, it is difficult to measure how well a given algorithm performs. In addition, we show that the quality of the windowing is dependent on which task an analyst wants to perform on the network after windowing. Therefore the time scale detection problem should not be handled independently from the rest of the analysis of the network. We introduce a framework that tackles both of these issues: By measuring the performance of the time scale detection algorithm based on how well a given task is accomplished on the resulting network, we are for the first time able to directly compare different time scale detection algorithms to each other. Using this framework, we introduce time scale detection algorithms that take a supervised approach: they leverage ground truth on training data to find a good windowing of the test data. We compare the supervised approach to previous approaches and several baselines on real data.
Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.
Group discussions are a way for individuals to exchange ideas and arguments in order to reach better decisions than they could on their own. One of the premises of productive discussions is that better solutions will prevail, and that the idea selection process is mediated by the (relative) competence of the individuals involved. However, since people may not know their actual competence on a new task, their behavior is influenced by their self-estimated competence --- that is, their confidence --- which can be misaligned with their actual competence. Our goal in this work is to understand the effects of confidence-competence misalignment on the dynamics and outcomes of discussions. To this end, we design a large-scale natural setting, in the form of an online team-based geography game, that allows us to disentangle confidence from competence and thus separate their effects. We find that in task-oriented discussions, the more-confident individuals have a larger impact on the group's decisions even when these individuals are at the same level of competence as their teammates. Furthermore, this unjustified role of confidence in the decision-making process often leads teams to under-perform. We explore this phenomenon by investigating the effects of confidence on conversational dynamics.