Society is at a turning point, and it’s the result of big data. Many believe the explosion of ways to identify, collect, store, and process information will provide an unprecedented ability for people to understand and control the natural world and especially the social world.
This trend toward exploiting incredibly large data sets, which could also be used to make predictions, is generally lumped under the term “big data.” The term first emerged in the 1980s to describe the impact computers had on the social sciences in the 1960s and 1970s. Indeed, the need to understand larger and larger data sets was a driving force behind the development of computational technology.
The roots of big data go back much further than the current “information age.” The need to grapple with data sets beyond one person’s native ability began almost 10,000 years ago, when our ancestors abandoned hunter-gatherer lifestyles for agriculture. The agricultural revolution led to population concentrations with more complex political organizations and more extensive craft specialization and trade. In turn, this required better ways to account for people and goods.
Archaeologists believe that alphabetic and numeric systems and arithmetic first began in the ancient agricultural Near East. Their purpose: to keep track of crops and livestock collected as taxes by central theocratic bureaucracies. Calculating devices such as the abacus soon followed.
Consider just one use of today’s big data with a deep history and a major impact on computational technology: keeping track of a country’s citizenry. This has often been accomplished through a periodic counting, or census. Many references to censuses exist in the ancient world, from Egyptian tomb inscriptions and the Hebrew Bible to, perhaps, most famously, the “worldwide” Roman census described in the Book of Luke in the New Testament.
In 2 C.E., Han Dynasty China conducted a census—the largest for centuries to come—whose accuracy is considered remarkable for the time. It counted 59.6 million individuals and 12.36 million households. In 1086, William I “the Conqueror” of England commissioned the Domesday Book, in which were recorded the names of all the landowners in his kingdom and their possessions. His aim: to better collect taxes. Since William did not have access to the tax records of the king he defeated, Harold Godwinson*, he had to start from scratch.
An even greater need for accurate counts of citizens emerged with the formation of highly centralized and bureaucratic modern nation-states. At the time of the Domesday Book, the population of what was to become the United Kingdom was probably under 2 million.
The 17th-century Enlightenment polymath Sir William Petty, a founder of the U.K.’s Royal Society, developed a number of statistical techniques to estimate this population, which in his day was probably about 6 million. By the time George III ordered the first modern British census to help govern his burgeoning empire in 1801, the count was almost 11 million people. George’s action followed well after the first systematic census in Europe had been conducted by Great Britain’s rival Prussia way back in 1719, and after the first census by his breakaway colony, the United States, was made in 1790.
It was in the United States where an accurate population count took on its greatest importance and had the greatest impact on statistics—especially on tabulating and processing technology. Population counts were needed not just for military musters and economic planning but also for political representation in the growing young nation. The U.S. Constitution mandates “an actual enumeration” every 10 years. The first U.S. census, in 1790, revealed about 4 million inhabitants. By 1870 the number had reached more than 38 million, and clerks had great difficulty processing the data by hand.
Accordingly, the U.S. Census Bureau began to experiment with ways to automate the process. Based on his work for the 1880 census, an engineer named Herman Hollerith, a member of the bureau’s technical team, felt he could improve the process. He got busy and, in 1884, filed a patent for an electromechanical device that rapidly read information encoded by punching holes on a paper tape or a set of cards [see photo, above]. In 1889 Hollerith’s Tabulating Machine Co. was chosen to process the 1890 census. The project was successful; on 16 August 1890, the population of the United States was put at 62,622,250 people.
Because Hollerith had a monopoly on the new method of data processing, he charged a premium for his equipment. In reaction, the bureau developed its own punch-card equipment based on his ideas. After a number of patent battles, which Hollerith ultimately lost, competitors made innovations in the field of tabulation and calculation. His company went on to evolve into International Business Machines (now better known as IBM).
The Census Bureau was not done with innovation. The census of 1940 revealed more than 132 million inhabitants, putting a strain on the Hollerith-type system. A company formed by John Presper Eckert and John Mauchly, who invented the ENIAC computer during World War II, received a contract from the bureau to develop an electronic computer to tabulate the census. Remington Rand, a pioneer in calculating machines, bought Eckert and Mauchly’s company, and on 31 March 1951 delivered the Univac, which stored data on magnetic tape [see top photo] instead of punch cards. The bureau immediately put it to work on its 1950 data.
The need to store and manipulate large data sets from the census was critical to the evolution of the computer and to the birth of the information age, which some date to around 1948, with the publication of Claude E. Shannon’s information theory and Norbert Wiener’s Cybernetics. It also led to the invention of the transistor as well as the running of the first stored-program computer—and, ultimately, to the rise of big data.
If a key characteristic of big data proves to be its value in prediction, one story about Univac drives this point home. As a publicity stunt during the 1952 U.S. presidential election, Remington Rand set up a Univac to monitor and predict its outcome, which pollsters projected to be very close. When the computer predicted in the early evening a landslide for Dwight D. Eisenhower, journalists at the CBS television network doubted the result and refused to announce it until much later that night. In perhaps its first predictive foray, big data proved to be right.
*This article has been corrected from the version that appeared in print.