David Weinberger on Social Science and Big Data

Excerpts below from David Weinberger’s “The Machine That Would Predict the Future” (Dec 2011 Scientific American). Full text pdf available here.

Summary: Article contrasts two different ways to order the proliferation of social science data known as Big Data. On one hand Weinberger describes the attempt to centralize all data in one machine, which Weinberger contrasts, on the other hand, with the “cloud” model of links and interconnections between multiple data centers.

Thesis: the problem with the ‘machine to fit all data’ is this centralization undermines the ability to analytically draw the context out of the data. And context is the knowledge actual actors need because, by definition, actors operate in particular locations (or fields) within the overall system producing the data. Actual, particular, localized individual actors benefit more from the unit of analysis provided by the cloud model. (This is my reading of the thesis, not necessarily the author’s.)


In the summer and fall of last year, the Greek financial crisis tore at the seams of the global economy.

Having run up a debt that it would never be able to repay, the country faced a number of poten- tial outcomes, all unpleasant. Efforts to slash spending spurred riots in the streets of Athens, while threats of default rattled global financial markets. Many economists argued that Greece should leave the euro zone and devalue its currency, a move that would in theory help the economy grow. “Make no mistake: an orderly euro exit will be hard,” wrote New York Universi- ty economist Nouriel Roubini in the Financial Times. “But watching the slow disorderly implo- sion of the Greek economy and society will be much worse.”

No one was sure exactly how the scenario would play out, though. Fear spread that if Greece were to abandon the euro, Spain and Italy might do the same, weakening the central bond of the European Union. Yet the Economist opined that the crisis would “bring more fiscal-policy control from Brussels, turning the euro zone into a more politically integrated club.” From these consequences would come yet further-flung effects. Mi- grants heading into the European Union might shift their travel patterns into a newly affordable Greece. A drop in tourism could

limit the spread of infectious disease. Altered trade routes could disrupt native ecosystems. The question itself is simple— Should Greece drop the euro?—but the potential fallout is so far-reaching and complex that even the world’s sharpest minds found themselves unable to grasp all the permutations.

Questions such as this one are exactly what led Dirk Helbing, a physicist and the chair of sociology at the Swiss Federal Insti- tute of Technology Zurich, to propose a €1-billion computing sys- tem that would effectively serve as the world’s crystal ball.

Helbing’s system would simulate not just one area of finance or policy or the environment. Rather it would simulate everything all at once—a world within the world—spitting out answers to the toughest questions policy makers face. The centerpiece of this project, the Living Earth Simulator, would attempt to model glob- al-scale systems—economies, governments, cultural trends, epi- demics, agriculture, technological developments, and more—us- ing torrential data streams, sophisticated algorithms, and as much hardware as it takes. The European Commission was so moved by Helbing’s pitch that it chose his project as the top- ranked of six finalists in a competition to receive €1 billion.
The system is the most ambitious expression of the rise of ”big data,” a trend that is striking many scientists as being on a par with the invention of the telescope and microscope. The exponen- tial growth of digitized information is bringing together comput- er science, social science and biology in ways that let us address questions we just otherwise could not have posed, says Nicholas Christakis, a social scientist and professor of medicine at Harvard University. As an example, he points to the ubiquity of mobile phones that create oceans of information about where individu- als are going, what they are buying, and even traces of what they are thinking. Combine that with other kinds of data—genomics, economics, politics, and more—and many experts believe we are on the cusp of opening up new worlds of inquiry.

“Scientific advance is often driven by instrumentation,” says David Lazer, an associate professor in the College of Computer and Information Science at Northeastern University and a sup- porter of Helbing’s project. Tools attract the tasks, or as Lazer puts it: “Science is like the drunk looking for his keys under the lamppost because the light is better there.” For Helbing’s support- ers, the ranks of which include dozens of respected scientists all over the world, €1 billion can buy a pretty bright light.

Many scientists are not convinced of the need to gather the world’s data in a centralized collection, though. Better, they argue, to form data clouds on the Internet, connected by links to make them useful to all. A shared data format will give more people the opportunity to poke around through the data, find hidden con- nections and create a marketplace of competing ideas.

. . .

[F]inding correlations in sets of data is nothing out of the ordi- nary for modern science, even if those sets are now gigantic and the correlations span astronomical distances. . . . Yet this type of agent-based modeling works only in a very narrow set of circumstances, according to Gary King, director of the Institute for Quantitative Social Science at Harvard. In the case of a highway or the hajj, everyone is heading in the same di- rection, with a shared desire to get where they are going as quick- ly and safely as possible. Helbing’s FuturICT system, in contrast, aims to model systems in which people are acting for the widestvariety of reasons (from selfish to altruistic); where their incentives may vary widely (getting rich, getting married, staying out of the papers); where contingencies may erupt (the death of a world leader, the arrival of UFOs); where there are complex feedback loops (an expert’s finan- cial model brings her to bet against an industry, which then panics the market); and where there are inputs, outputs and feedback loops from re- lated models. The economic model of a city, for example, depends on models of traffic patterns, agricultural yields, demographics, climate and epidemiology, to name a few.

Beyond the problem of sheer complexity, scientists raise a number of interrelated chal- lenges that such a comprehensive system would have to overcome. To begin with, we don’t have a good theory of social behavior from which to start. King explains that when we have a solid idea of how things work—in physical systems, for example—we can build a model that suc- cessfully predicts outcomes. But whatever theo- ries of social behavior we do have fall far short of the laws of physics in predictive power.

To further add to the challenge, news of a model’s conclusions can alter the situation it is modeling. “This is the big scientific question,” says Alessandro Vespignani, director of the Center for Complex Networks and Systems Research at Indiana University and the project’s lead data planner. “How can we develop models that include feedback loops or real-time data monitors that let us continuously update our algorithms and get new predictions” even as the predictions affect their own conditions?

The models also have to be incredibly intricate and particular. For example, if you ask an economic model if your city should re- claim some land and if the model does not take account how that decision affects the food chain, it can generate a result that might be good economics but disastrous for the environment. With 10 million species, simply learning which one eats what is a daunt- ing task. Further, relevant variances in food do not stop at the species level. Jesse Ausubel, an environmental scientist at the Rockefeller University, points out that by analyzing the DNA of the contents of the stomachs of bats, we can know for sure exactly what bats eat. But the food source of bats in a specific cave might be different from the food source of bats of the same species a few miles away. Without crawling through the guano-coated particu- larities cave by cave, experts relying on interrelated models may encounter unreliable and cascading effects.

This entry was posted in contextualized vs aggregative data, hard data, intangible assets, intellectual property, Symbolic data, symbolic vs hard data, the database. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s