May 16 2018 cs.AI
The curse of dimensionality in the realm of association rules is twofold. Firstly, we have the well known exponential increase in computational complexity with increasing item set size. Secondly, there is a \emphrelated curse concerned with the distribution of (spare) data itself in high dimension. The former problem is often coped with by projection, i.e., feature selection, whereas the best known strategy for the latter is avoidance. This work summarizes the first attempt to provide a computationally feasible method for measuring the extent of dimension curse present in a data set with respect to a particular class machine of learning procedures. This recent development enables the application of various other methods from geometric analysis to be investigated and applied in machine learning procedures in the presence of high dimension.
Feb 23 2018 cs.SI
It is well known that any bipartite (social) network can be regarded as a formal context $(G,M,I)$. Therefore, such networks give raise to formal concept lattices which can be investigated utilizing the toolset of Formal Concept Analysis (FCA). In particular, the notion of clones in closure systems on $M$, i.e., pairwise interchangeable attributes that leave the closure system unchanged, suggests itself naturally as a candidate to be analyzed in the realm of FCA based social network analysis. In this study, we investigate the notion of clones in social networks. After building up some theoretical background for the clone relation in formal contexts we try to find clones in real word data sets. To this end, we provide an experimental evaluation on nine mostly well known social networks and provide some first insights on the impact of clones. We conclude our work by nourishing the understanding of clones by generalizing those to permutations of higher order.
Geometric analysis is a very capable theory to understand the influence of the high dimensionality of the input data in machine learning (ML) and knowledge discovery (KD). With our approach we can assess how far the application of a specific KD/ML-algorithm to a concrete data set is prone to the curse of dimensionality. To this end we extend V.~Pestov's axiomatic approach to the instrinsic dimension of data sets, based on the seminal work by M.~Gromov on concentration phenomena, and provide an adaptable and computationally feasible model for studying observable geometric invariants associated to features that are natural to both the data and the learning procedure. In detail, we investigate data represented by formal contexts and give first theoretical as well as experimental insights into the intrinsic dimension of a concept lattice. Because of the correspondence between formal concepts and maximal cliques in graphs, applications to social network analysis are at hand.
The k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e.g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instance (point) to be classified, such that the expected accuracy of classification is maximized. We define the expected accuracy as the accuracy of a set of structurally similar observations. An arbitrary similarity function can be used to find these observations. We introduce and evaluate different similarity functions. For the evaluation, we use five different classification tasks based on geo-spatial data. Each classification task consists of (tens of) thousands of items. We demonstrate, that the presented expected accuracy measures can be a good estimator for kNN performance, and the proposed adaptive kNN classifier outperforms common kNN and previously introduced adaptive kNN algorithms. Also, we show that the range of considered k can be significantly reduced to speed up the algorithm without negative influence on classification accuracy.
Jun 20 2017 cs.SI
Peatland fires and haze events are disasters with national, regional and international implications. The phenomena lead to direct damage to local assets, as well as broader economic and environmental losses. Satellite imagery is still the main and often the only available source of information for disaster management. In this article, we test the potential of social media to assist disaster management. To this end, we compare insights from two datasets: fire hotspots detected via NASA satellite imagery and almost all GPS-stamped tweets from Sumatra Island, Indonesia, posted during 2014. Sumatra Island is chosen as it regularly experiences a significant number of haze events, which affect citizens in Indonesia as well as in nearby countries including Malaysia and Singapore. We analyse temporal correlations between the datasets and their geo-spatial interdependence. Furthermore, we show how Twitter data reveals changes in users' behavior during severe haze events. Overall, we demonstrate that social media is a valuable source of complementary and supplementary information for haze disaster management. Based on our methodology and findings, an analytics tool to improve peatland fire and haze disaster management by the Indonesian authorities is under development.
When evaluating the cause of one's popularity on Twitter, one thing is considered to be the main driver: Many tweets. There is debate about the kind of tweet one should publish, but little beyond tweets. Of particular interest is the information provided by each Twitter user's profile page. One of the features are the given names on those profiles. Studies on psychology and economics identified correlations of the first name to, e.g., one's school marks or chances of getting a job interview in the US. Therefore, we are interested in the influence of those profile information on the follower count. We addressed this question by analyzing the profiles of about 6 Million Twitter users. All profiles are separated into three groups: Users that have a first name, English words, or neither of both in their name field. The assumption is that names and words influence the discoverability of a user and subsequently his/her follower count. We propose a classifier that labels users who will increase their follower count within a month by applying different models based on the user's group. The classifiers are evaluated with the area under the receiver operator curve score and achieves a score above 0.800.
Much attention has been given to the task of gender inference of Twitter users. Although names are strong gender indicators, the names of Twitter users are rarely used as a feature; probably due to the high number of ill-formed names, which cannot be found in any name dictionary. Instead of relying solely on a name database, we propose a novel name classifier. Our approach extracts characteristics from the user names and uses those in order to assign the names to a gender. This enables us to classify international first names as well as ill-formed names.
Understanding the structures why links are formed is an important and prominent research topic. In this paper, we therefore consider the link prediction problem in face-to-face contact networks, and analyze the predictability of new and recurring links. Furthermore, we study additional influence factors, and the role of stronger ties in these networks. Specifically, we compare neighborhood-based and path-based network proximity measures in a threshold-based analysis for capturing temporal dynamics. The results and insights of the analysis are a first step onto predictability applications for human contact networks, for example, for improving recommendations.
This paper focuses on the prediction of real-world talk attendances at academic conferences with respect to different influence factors. We study the predictability of talk attendances using real-world tracked face-to-face contacts. Furthermore, we investigate and discuss the predictive power of user interests extracted from the users' previous publications. We apply Hybrid Rooted PageRank, a state-of-the-art unsupervised machine learning method that combines information from different sources. Using this method, we analyze and discuss the predictive power of contact and interest networks separately and in combination. We find that contact and similarity networks achieve comparable results, and that combinations of different networks can only to a limited extend help to improve the prediction quality. For our experiments, we analyze the predictability of talk attendance at the ACM Conference on Hypertext and Hypermedia 2011 collected using the conference management system Conferator.
With social media and the according social and ubiquitous applications finding their way into everyday life, there is a rapidly growing amount of user generated content yielding explicit and implicit network structures. We consider social activities and phenomena as proxies for user relatedness. Such activities are represented in so-called social interaction networks or evidence networks, with different degrees of explicitness. We focus on evidence networks containing relations on users, which are represented by connections between individual nodes. Explicit interaction networks are then created by specific user actions, for example, when building a friend network. On the other hand, more implicit networks capture user traces or evidences of user actions as observed in Web portals, blogs, resource sharing systems, and many other social services. These implicit networks can be applied for a broad range of analysis methods instead of using expensive gold-standard information. In this paper, we analyze different properties of a set of networks in social media. We show that there are dependencies and correlations between the networks. These allow for drawing reciprocal conclusions concerning pairs of networks, based on the assessment of structural correlations and ranking interchangeability. Additionally, we show how these inter-network correlations can be used for assessing the results of structural analysis techniques, e.g., community mining methods.
Onomastics is "the science or study of the origin and forms of proper names of persons or places." ["Onomastics". Merriam-Webster.com, 2013. http://www.merriam-webster.com (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste. With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names. The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed. The discovered relations among given names are the foundation of "nameling" [http://nameling.net], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.
All over the world, future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and especially personal taste. Although this task is omnipresent, little research has been conducted on the analysis and application of interrelations among given names from a data mining perspective. The present work tackles the problem of recommending given names, by firstly mining for inter-name relatedness in data from the Social Web. Based on these results, the name search engine "Nameling" was built, which attracted more than 35,000 users within less than six months, underpinning the relevance of the underlying recommendation task. The accruing usage data is then used for evaluating different state-of-the-art recommendation systems, as well our new NameRank algorithm which we adopted from our previous work on folksonomies and which yields the best results, considering the trade-off between prediction accuracy and runtime performance as well as its ability to generate personalized recommendations. We also show, how the gathered inter-name relationships can be used for meaningful result diversification of PageRank-based recommendation systems. As all of the considered usage data is made publicly available, the present work establishes baseline results, encouraging other researchers to implement advanced recommendation systems for given names.
Social bookmarking systems allow users to organise collections of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emergent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characterizing it semantically is still largely open. Here we analyze three measures of tag relatedness: tag co-occurrence, cosine similarity of co-occurrence distributions, and FolkRank, an adaptation of the PageRank algorithm to folksonomies. Each measure is computed on tags from a large-scale dataset crawled from the social bookmarking system del.icio.us. To provide a semantic grounding of our findings, a connection to WordNet (a semantic lexicon for the English language) is established by mapping tags into synonym sets of WordNet, and applying there well-known metrics of semantic similarity. Our results clearly expose different characteristics of the selected measures of relatedness, making them applicable to different subtasks of knowledge extraction such as synonym detection or discovery of concept hierarchies.