Data Miners

IEEE Fellows have developed innovative ways to extract data from large repositories

15 September 2014

This article is part of our series highlighting IEEE Fellows in celebration of the Fellow program’s 50th anniversary year.

Several IEEE Fellows have been digging deep into data repositories with analytical tools to find value among all the nuggets of information there. They’ve built systems to analyze all kinds of data coming from sources that include product reviews, social media, and even the game of baseball.

AchievementFellowsMichaelBerthold Photo: Konstanz University

IEEE Fellow Michael Berthold helped develop KNIME, short for Konstanz Information Miner, an open-source data analytics, reporting, and integration platform. Berthold, chair for bioinformatics and information mining at Konstanz University, in Germany, is also president of KNIME AG, the company formed in Zurich to market the system.

The platform has been around since 2006 and was first used for pharmaceutical research. Today it is being applied to data of every kind: numbers, images, molecules, signals, complex networks—you name it. The second version of KNIME was released in April.

Berthold’s research also focuses on using machine-learning methods for the interactive analysis of large information repositories in the life sciences. Machine learning deals with the construction and study of systems that can learn to do things from the data, rather than following explicitly programmed instructions. His research results are available on—where else?—the KNIME platform.

Berthold was elevated to IEEE Fellow in 2011 “for contributions to approximate learning algorithms for life science data mining.”


AchievementFellowsBingLiu Photo: Roberta Dupuis-Devlin/University of Illinois at Chicago

Retailers need all the help they can get to beat the competition, and positive product reviews are one way to boost sales. But some retailers have stooped to having their own employees post glowing online reviews of their merchandise. At other times, they’ve even paid people to write them. This practice is deceptive, of course, and it can be difficult to spot the fakes. Newly elevated Fellow Bing Liu has developed a way to detect these misleading raves with an analytic method called opinion mining.

Also known as sentiment analysis, it relies on natural language processing, text analysis, and computational linguistics to identify and extract from written language subjective information about people’s evaluations, their attitudes, and even their emotions. Liu is a professor in the department of computer science at the University of Illinois at Chicago. He was elevated this year “for contributions to data mining.”

In a 2012 interview with the blog Content26, Liu explained that researchers have built computer detection models that use linguistic features (or signals). These include a review’s content and metadata features such as the star rating, the reviewer’s user ID and geographic location, the host IP address, and the time when the review was posted. Using this data, Liu developed detection software to rid the Web of fake reviews, fake comments, and fake blogs.

“If a product is not selling well but has a large number of positive reviews, these reviews are clearly suspicious,” he said. The good news is that almost all review-hosting sites are actively dealing with this problem. “It gets harder and harder for impostors to post fake reviews,” he noted.


AchievementFellowsLiuHuan Photo: Arizona State University

IEEE Fellow Huan Liu (no relation) has developed ways to track public emergencies and natural disasters, and to monitor potential terrorist threats through crowd-sourcing capabilities. He is a professor in the School of Computing, Informatics, and Decision Systems Engineering, part of the Ira A. Fulton Schools of Engineering at Arizona State University, in Tempe. He is also director of its Data Mining and Machine Learning Laboratory, where he is at work on two projects: TweetTracker and TweetXplorer.

TweetTracker collects, filters, and sorts tweets based on a variety of inputs. It can look for keywords like “hurricane,” and consider things like geographic regions, specific names in the tweets, hashtags, URLs, and even tweets in specific languages. The system then gathers and stores the targeted tweets.

TweetXplorer then uses this information to create visual analytics such as maps and graphs that enable an analyst to dig deeper. The systems could help first responders, law enforcement agencies, and even the military monitor situations around the world by tracking related tweets and re-tweets.

Liu was elevated in 2012 “for contributions to feature selection in data mining and knowledge discovery.”


New technology being tested this season in three ballparks by Major League Baseball could change the thinking about how the games are played. Miller Park in Milwaukee, Target Field in Minneapolis, and Citi Field in New York are the three stadiums fitted for these measurements.

AchievementFellowsClaudioSilva Photo: Polytechnic School of Engineering

Developed by IEEE Fellow Claudio Silva, the technology is expected to revolutionize the way people evaluate baseball games by using data to connect all the actions that happen on the field—batting, pitching, fielding, base running—and determine how they work together, as well as pinpoint how an individual player’s game stands out. Silva is a professor of computer science and engineering at New York University’s Polytechnic School of Engineering, in Brooklyn.

The data comes from cameras that track the players and radar that tracks the ball’s route across the field. The cameras are mounted in clusters and simulate human eyes, providing stereoscopic vision to judge depth and movement. Software records the position of each player on the field 30 times a second, while radar tracks the flight path of the ball 20,000 times per second. It all adds up to some 7 terabytes of data per game.

When a batter hits the ball, the system gauges the ball’s speed, launch angle, distance, and hang time; each fielder’s first step, acceleration, speed, and route to the ball; and each base runner’s speed and route. Silva said anyone who watches baseball, from club owners to player to fan, will be privy to a new baseball world that was completely unexpected.

In a March interview with MLB.com, Silva said the biggest challenge in developing the system was ensuring that the data received reflects actual game play. To accomplish this, he said, he had to design a validation scheme “where we recorded our own video, and designed algorithms that would independently generate some of the metrics to be compared to the data that we were getting out of the system.

“To be one of the first few people to have the luxury of looking at the new data stream was a true privilege,” Silva added. “This data is so rich and there are so many interesting things we can do with it. We’re going to… find layers and layers of features that we never could see before.”

Silva was elevated in 2013 “for contributions to geometric computing and visualization.”

Visit the IEEE Fellow website to learn more about the Fellow program. Nominations for the class of 2016 are being accepted now.

Learn More