The pervasive use of social media sites such as Facebook, Instagram, LinkedIn, and Twitter have been producing great amounts of a new form of data, simply known as social media data. It is mostly user-generated, informal, incomplete, and multi-media, and is often accompanied with information about time and location.
This form of data is an exceptionally rich resource that allows researchers in this field to study and understand human behavior and activities in unprecedented ways. The data is large in scale with quick updates taking place all the time, and spreads far and wide in an increasingly “flat,” or digital, world. As an example of mining social media data, two researchers at HP Labs analyzed movie mentions in 2.9 million tweets from 1.2 million users over three months. They discovered Twitter data could be used to predict box-office sales with impressive accuracy, outperforming market research. Since then, many successful ad hoc experiments ensued using Twitter data to make predictions.
A natural question, therefore, is can we repeat the same success in scientific research? And if we were to do so, what challenges do we need to address to accomplish this?
The first challenge is gathering sufficient amounts of data for a particular application. Put another way, is social media data on a specific topic “big,” or sufficient, enough? The second challenge is evaluating the success of social media mining, or figuring out how to determine whether the mined results are not spurious or accidental. Or there will simply not be enough to work with to make a conclusion at all.
Let us first examine the bigness of social media data. If we look at it collectively, it is certainly big. When we probe from the viewpoint of data mining, the answer to the question becomes less straightforward. As many researchers have reported, social media data often roughly follows power-law distributions, which suggests there are a few prominent figures and a large number of common users. These distinguished figures attract an inordinate number of followers and there is a significant amount of data that is collectable about them. This means there is a lot of information about one item, say a celebrity and her whereabouts, but not enough about other things such as the energy consumption of the average user.
The true treasure is oftentimes embedded in the data that is less frequently shared, including that about common users, which is valuable because they are potential consumers who can help increase profits for businesses. If we look at ourselves or our friends we ask whether data produced about us or them is big. The answer is: not really. For example, there may be a lot of data about where a prominent figure shops, eats, and so on throughout the day, but this is not the case for average users who only share a few things about their day. This can make it difficult to collect enough data to tell a full story about who the average person is.
Therefore, social media mining faces a dilemma: We have sufficient data about those we don’t need information about, but lack data or have little—or thin—data about those we want to know more about. A solution to this challenge is to make thin data thick if we want to change the situation for effective mining. One way to do so is to identify the same users across multiple social media sites to get more data on them and better understand their online behaviors and interests. A behavior-based approach will help tackle this problem.
The second challenge has to do with empirical evaluation. Let’s look at how we conventionally conduct evaluation for data mining to ensure that results are verifiable and reproducible. A common approach is to evaluate a data-mining algorithm or model on data with ground truth following a standard procedure (say, cross-validation). Here, data with ground truth usually means we have both training and test datasets. Training data is used for learning, and test data is exclusively used for evaluation. Therefore, if other researchers would like to compare algorithms, they simply use the same data following the same procedure to obtain results. If experimental results are obtained without significant differences, results are verifiable and reproducible.
However, for social media mining, we usually do not have training or test datasets. When we study how information spreads, for example, we suspect it could be caused by either influence or homophily—information sharing due to user similarity—and we do not always know how to differentiate the two. Two cases are presented in recent research papers that cover detecting migration patterns in social media and verifying sample goodness without ground truth.
The above examples illustrate why social media mining differs from other forms of data mining and hopefully piques your interest to learn more about this new and exciting area. If you would like to better understand how to mine and analyze a social media site, I recommend reading two recently published books, which come with free PDF downloads: Twitter Data Analytics (Springer, 2013), which includes sample datasets and code used to generate the examples featured in the book, and Social Media Mining: An Introduction (Cambridge University Press, 2014). You’re also welcome to leave your questions for me in the comments section below.
Huan Liu is an IEEE Fellow and a professor of computer science and engineering at Arizona State University, in Phoenix. He is also the director of the school’s Data Mining and Machine Learning Laboratory, where he is at work on two projects: TweetTracker and TweetXplorer.