# NIPS2010 Workshop on Machine Learning for Social Computing

## Schedule

• Date: Saturday, December 11, 2010
• Time: 7:30am - 6:30pm
• Venue: Westin: Alpine A

Time Speaker Title Note
7:30-7:35 Opening Remarks
7:35-8:20 Eric P. Xing [Invited Talk] Multiscale Community Block model for Network Exploration
8:20-8:40 Ramesh Nallapati,
Christopher Manning
TopicFlow Model: Unsupervised Learning of Topic-speci c Influences of Hyperlinked Documents paper5.pdf
8:40-9:00 Brendan O'Connor,
Jacob Eisenstein,
Eric P. Xing,
Noah Smith
Discovering Demographic Language Variation paper7.pdf
9:00-9:25 Coffee Break and Poster Session paper9.pdf
paper20.pdf
paper14.pdf
paper19.pdf
9:25-10:10 Lars Backstrom [Invited Talk] Pepole You May Know: Friend Suggestions on Facebook
10:10-10:30 Ankan Saha,
Vikas Sindhwani
Dynamic NMFs and Temporal Regularization for Online Analysis of Streaming Text paper16.pdf
10:30-15:30
15:30-16:15 Lee Giles [Invited Talk] Entity Disambiguation and Social Computing
16:15-16:35 Jeon Hyung Kang,
Kristina Lerman
Integrating Specialist and Folk Knowledge with Affinity Propagation paper7.pdf
16:35-16:55 Piyush Rai,
Anusua Trivedi,
Hal Daume,
Scott DuVall
Multiview Clustering with Incomplete Views paper12.pdf
16:55-17:30 Coffee Break and Poster Session
17:30-17:50 Saurabh Kataria,
Prasenjit Mitra,
Lee Giles
Context Sensitive Topic Models for Author Influence in a Linked Corpus paper22.pdf
17:50-18:10 Delip Rao,
David Yarowsky
Detecting Latent User Properties in Social Media paper13.pdf
18:10-18:30 General Discussion

### Posters

#### TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents

• Authors: Ramesh Nallapati, Christopher Manning
• Abstract: Modeling influence of entities in networked data is an important problem in information retrieval and data mining. Popular algorithms such as PageRank capture this notion of authority by analyzing the hyperlink structure, but they ignore the topical content of the document. However, often times, authority is topic dependent, e.g., a web page of high authority in politics may be an unknown entity in sports. In this work, we describe a new model called TopicFlow, that combines ideas from network flow and topic modeling, and captures the notion of topic specific influences of hyperlinked documents in a completely unsupervised fashion. We show that on the task of citation recommendation, the TopicFlow model, when combined with TF-IDF based cosine similarity, outperforms several competitive baselines by as much as 6.4%. We also present some qualitative visualizations to demonstrate the expressive power of the new model.
• Paper: paper5.pdf
• Slides:

#### Integrating Specialist and Folk Knowledge with Affinity Propagation

• Authors: Jeon Hyung Kang, Kristina Lerman
• Abstract: Knowledge on the social Web grows each time a user annotates a resource, for example, a Web page, a scientific article, a photo, or a video. While attaching descriptive labels, known as tags, to resources is still the most popular form of annotation, some social Web sites also allow users to create structured annotations. For example, social bookmarking sites Delicious (http://del.icio.us) and Bibsonomy (http://bibsonomy.org) allow users to specify broader–narrower relations between tags, and the social photosharing site Flickr (http://flickr.com) allows users to organize photos within folder-like hierarchies. While such annotations reflect individual users' needs and requirements for organizing the content they create, collectively social annotations provide valuable evidence for harvesting social knowledge. Folksonomies, or taxonomies of concepts automatically extracted from social annotations of many users, will eventually help us better search for, browse, organize, manage, and integrate information on the Web.
• Paper: paper6.pdf
• Slides:

#### Discovering Demographic Language Variation

• Authors: Brendan O'Connor, Jacob Eisenstein, Eric Xing, Noah Smith
• Abstract: “Even within a single language community, speakers from different backgrounds demonstrate substantial linguistic variation. Salient speaker characteristics include geography [9, 6], race [12], and socioeconomic status [8, 4]; they impact language at the phonological, lexical, and morphosyntactic levels [14]. Sociolinguistics and dialectology feature a strong quantitative tradition of studying the relationship between language and social and geographical identity. In general, these approaches begin by identifying both the communities of interest and the relevant linguistic dimensions of variability; for example, a researcher might identify the term “yinz” as characteristic of Pittsburgh dialect [3], and then model its relationship to the socioeconomic status of the speaker. Thus, while this approach has a quantitative foundation in modeling the relationship between linguistic and extra- linguistic data, it requires extensive fieldwork and linguistic expertise to identify the “inputs” that are to be analyzed. In this paper, we propose a new exploratory methodology for discovering demographic and geographic language variation from text and metadata. We unite these information sources in a Bayesian generative model, which explains both linguistic variation and demographic features through a set of generative distributions, each of which is associated with a (latent) community of speakers. Thus, our model is capable of discovering both the relevant sociolinguistic communities, as well as the key dimensions of linguistic variation.
• Paper: paper7.pdf
• Slides:

#### Multiview Clustering with Incomplete Views

• Authors: Piyush Rai, Anusua Trivedi, Hal Daume, Scott DuVall
• Abstract: Multiview clustering algorithms allow leveraging information from multiple views of the data and therefore lead to improved clustering. A number of kernel based multiview clustering algorithms work by using the kernel matrices defined on the different views of the data. However, these algorithms assume availability of features from all the views of each example, i.e., assume that the kernel matrix for each view is complete. We present an approach that allows these algorithms to be applicable even when only one (the primary) view is complete and the auxiliary views are incomplete (i.e., features from these views are available only for some of the examples). Taking the kernel CCA based multiview clustering as an example, we apply our method on webpage clustering with multiple views of the data where one view is the page-text and other view is the social tags assigned to the webpage. We consider the case when the tags are available only for a small subset of the webpages which means that the tag view is incomplete. Experimental results establish the effectiveness of the proposed method.

#### Detecting Latent User Properties in Social Media

• Authors: Delip Rao, David Yarowsky
• Abstract: The ability to identify user attributes such as gender, age, regional origin, and political orientation solely from user language in social media such as Twitter or similar highly informal content has important applications in advertising, personalization, and recommendation. This paper includes a novel investigation of stacked-SVM-based classification algorithms over a rich set of original features, applied to classifying these four user attributes. We propose new sociolinguistics-based features for classifying user attributes in Twitter-style informal written genres, as distinct from the other primarily spoken genres previously studied in the user-property classification literature. Our models, singly and in ensemble, significantly outperform baseline models in all cases.

#### Classifying Text Messages for Emergency Response

• Authors: Cornelia Caragea, Hyun-Woo Kim, Prasenjit Mitra, John Yen
• Abstract: In case of emergencies (e.g., earthquakes, flooding), rapid responses are needed in order to address victims' requests for help. Hence, the ability to classify tweets and text messages automatically, together with the ability to deliver the relevant information to the appropriate personnel are essential for enabling the personnel to timely and efficiently work to address the most urgent needs, and to understand the emergency situation better. The choice of features used to encode tweets and text message data is crucial for the performance of the learning algorithms. Here, we present a comparative study of four types of feature representations used to enable learning classifiers from such data. These feature representations are obtained using a bag of words” approach, feature abstraction, feature selection, and Latent Dirichlet Allocation (LDA). The results of our experiments on a real-world text message data set show that feature abstraction can yield better performing models than those obtained by using a bag of words”, feature selection and LDA.

#### Dynamic NMFs and Temporal Regularization for Online Analysis of Streaming Text

• Authors: Vikas Sindhwani, Ankan Saha
• Abstract: Learning a dictionary of basis elements with the objective of building compact data representations is a problem of fundamental importance in statistics, machine learning and signal processing. In many settings, data points appear as a stream of high dimensional feature vectors. Streaming datasets present new twists to the problem. On one hand, basis elements need to be dynamically adapted to the statistics of incoming datapoints, while on the other hand, early detection of rising new trends is desirable in many applications. The analysis of social media streams formed by tweets and blog posts is a prime example of such a setting, where topics of social discussions need to be continuously tracked and new emerging themes are required to be rapidly detected. We formalize such problems in terms of online learning of dynamic non-negative matrix factorizations (NMF) with novel forms of temporal regularization. We describe a scalable optimization framework for our algorithms and report empirical results on detection and tracking of topics over simulated document streams and real-world news stories.
• Slides:

#### The Actor-Topic Model for Extracting Social Networks in Literary Narrative

• Authors: Asli Celikyilmaz, Dilek Hakkani-Tur, Hua He, Greg Kondrak, Denilson Barbosa
• Abstract: We present a generative model for conversational dialogues, namely the actor-topic model (ACTM), that extend the author-topic model(Rosen-Zvi, \textit{et.al}, 2004) to identify actors of given conversation in literary narratives. Thus ACTM assigns each instance of quoted speech to an appropriate character. We model dialogues in a literary text, which take place between two or more actors conversing on different topics, as distributions over topics, which are also mixtures of the term distributions associated with multiple actors. This follows the linguistic intuition that rich contextual information can be useful in understanding dialogues, eventually effecting the social network construction. We propose ACTM to ideally lead our research on social network extraction in literary narratives. Our experiments on nineteenth century English novels indicate that exploiting content structure of dialogues can yield significant improvements over a baseline using language models which is based on local context in constructing social interactions.

#### Context Sensitive Topic Models for Author Influence in a Linked Corpus

• Authors: Saurabh Kataria, Prasenjit Mitra, Lee Giles
• Abstract: In a document network such as citation network of scientific documents, web-logs, etc., the content produced by authors exhibit their interest in certain topics whereas some authors tend to influence other authors' interests. In this work, we propose to model the influence of cited authors along with the interests of citing authors. Moreover, we hypothesize that apart from the citations present in a documents, the context surrounding the citation mention provides extra topical information about the cited authors. However, associating terms in the context to the cited authors remain an open problem. We propose a novel document generation schemes that incorporates the context while modeling the interests of citing authors and influence of the cited authors simultaneously. We apply the proposed models to two text corpora: CiteSeer and CiteULike dataset. The experiments based upon log-likelihood fit on the test documents suggest significant improvements over the baseline models.

#### Finding Credible Sources in Twitter based on Relevance and Social Structure

• Authors: Kevin Canini, Bongwon Suh, Peter Pirolli
• Abstract: As an increasingly large amount of knowledge is shared between users in Twitter, it is becoming a popular source of relevant information to many people. In Twitter, information is transferred primarily via a social relationship called “following”. Identifying users to follow who are highly relevant to a particular topic of interest can be a difficult task. To address this problem, we introduce a novel method of automatically identifying and ranking Twitter users according to their relevancy to a given topic. The algorithm combines the standard Twitter text search mechanism with information about the social relationships in the network, effectively leveraging the opinions of the crowd. We performed a study comparing the ranked lists generated by the algorithm with lists provided by a commercial website specifically designed for the same purpose. Our initial findings show a good potential for automatically identifying interesting and relevant users.

#### Ruling out latent homophily in social networks

• Authors: Greg Ver Steeg, Aram Galstyan
• Abstract: “Despite recent high profile studies identifying counter-intuitive behaviors (e.g. obesity (Christakis and Fowler, 2007)) as being socially contagious, (Shalizi and Thomas, 2010) have demonstrated that homophily on latent attributes is indistinguishable from influence. For sociologists to unequivocally identify influence effects in networks they must rule out the possibility of latent homophily as an explanation. This requires either undertaking the Sisyphean task of measuring every hidden attribute that might influence the formation of links in social networks, or, our goal, determine the conditions for distinguishing influence from homophily even in the presence of unobserved attributes. Our test is inspired by the Bell inequalities: a simple inequality involving observed probability distributions which is obeyed by classical physics, but violated by quantum physics. We show any model producing correlations between actors through static latent homophily alone will obey certain constraints, and we develop and test a technique to detect violation of these constraints. ”