The Rise of Big Data
Are datasets the most valuable scientific instrument?
The importance of observation—the crux of the scientific method—remains unchanged from the early days of scientific discovery. The methods by which observations are made, however, have changed greatly. Consider astronomy. In the early days, under a black expanse of night punctuated by brilliant fiery lights, a group of science-minded people looked up at the sky and recorded what they saw—the fullness of the moon, the locations and formations of the stars.
Observation with the naked eye was the norm until the 17th century, when the invention of the telescope revolutionized astronomy, allowing scientists to see beyond what their eyes could show them—a literal portal into the unknown.
Now, a new revolution is taking place, in astronomy and across nearly all scientific disciplines: a data revolution. Scientific data collection has become almost entirely automated, allowing for the collection of vast amounts of data at record speed. These massive datasets allow researchers from various organizations and locales to mine and manipulate the data, making new discoveries and testing hypotheses from the contents of a spreadsheet.
"The astronomy community was able to switch to the idea that they can use a database as a telescope," says Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, Johns Hopkins University, as well as a researcher in the Sloan Digital Sky Survey (SDSS), a 10+ year effort to map one-third of the sky.
Thanks to projects like the SDSS and open access data from the Hubble Space Telescope, would-be Galileos don't need access to a telescope, or even a view of the night sky, to make discoveries about our universe. Instead, huge data sets (so-called "big data") can provide the optimal view of the sky, or, for that matter, the chemical base pairs that make up DNA.
How big is 'big data'?
It is hard to estimate exactly how much data exists today compared to the early days of computers. But, "the amount of personal storage has expanded dramatically due to items like digital cameras and 'intellectual prosthetics,' like iPhones," says Johannes Gehrke, professor, Department of Computer Science, Cornell University. "For example, if you bought a hard drive 20 years ago, you would have had 1.5 to 2 gigabytes of storage. Today, you can easily get 2 terabytes. That's a factor of 1,000."
It is not just the amount of data that has changed; the way we interact with and access that data has changed too, says Gehrke, a 2011 winner of the New York Academy of Sciences Blavatnik Awards for Young Scientists. "There is an entire industry that has sprung up around our ability to search and manage data—look at Google and Microsoft," says Szalay.
But what is big data? Is a 2-terabyte file considered big data? Not anymore. "It's a moving target," says Szalay. "In 1992, we thought a few terabytes was very challenging." Now, the average portable, external hard drive can store a few terabytes of data. An easy definition of big data is "more data than a traditional data system can handle," says Gehrke.
Searching for structure
Scientists working on large-scale projects, like the SDSS, or those in genomics or theoretical physics, now deal with many terabytes, even petabytes, of information. How is it possible to make sense of so much data?
"We have the data—we can collect it—but the bottleneck occurs when we try to look at it," says Szalay. Szalay is currently working on a project at Johns Hopkins to build a data-driven supercomputer (called a data scope) that will be able to analyze the big datasets generated by very large computer simulations, such as simulations of turbulence. "We are able to provide scientists who don't usually have access to this kind of computing power with an environment where they can play with very large simulations over several months; with this computer we are providing a home to analyze big data."
The rub? Scientists need to be fluent in computation and data analysis to use such resources. "Disciplines in science have been growing apart because they are so specialized, but we need scientists, regardless of their specific niche, to get trained in computation and data analytics. We need scientists to make this transition to ultimately increase our knowledge," says Szalay.
Two fields in particular are garnering attention from scientists for their ability to provide structure when data is overwhelming: data visualization and machine learning.
Data visualization takes numbers that are either generated by a large calculation or acquired with a measurement and turns them into pictures, says Holly Rushmeier, chair and professor, Department of Computer Science, Yale University, and a judge for the Academy's Blavatnik Awards for Young Scientists. For example, a project might take numbers representing flow going through a medium and turn them into an animation.
"Visualization allows you to look at a large volume of numbers and look for patterns, without having a preconceived notion of what that pattern is," says Rushmeier. In this way, visualization is both a powerful debugging tool (allowing researchers to see, through the creation of a nonsensical picture, if there might be a flaw with their data) and an important means for communication of data, whether to other researchers or to the general public (as in the case of weather forecasts). So perhaps the old adage needs to be re-written: Is a picture now worth a thousand lines of code?
"There are many flavors of visualization," says Rushmeier. Information can be mapped onto a natural structure, such as valves being mapped onto the heart, or an entirely new picture can be created (data without a natural structure is referred to as high-dimensional data). The classic example of high-dimensional data is credit card data, says Rushmeier, "but there is a lot of high-dimensional data in science."
Rushmeier is currently immersed in 3D mapping, working closely with an ornithologist who studies bird vision. He records light waves to which birds are sensitive, from the UV to the infrared, to get a better sense of how bird vision evolved and for what purposes (e.g., mating and survival). Through 3D mapping, Rushmeier is able to take the ornithologist's numerical data and simulate the actual viewpoint of the bird onto different 3D surfaces.
Learning without limits?
"To stop a conversation dead in its tracks, I tell people I work in statistics. To get a conversation going, I say I work in artificial intelligence," jokes David Blei. Both are true—Blei, associate professor, computer science, Princeton University, works in the field of machine learning, a field that encompasses both statistical and computational components.
The goal of machine learning is to build algorithms that find patterns in big datasets, says Blei. Patterns can either be predictive or descriptive, depending on the goal. "A classic example of a predictive machine-learning task is spam filtering," says Blei. A descriptive task could, for instance, help a biologist pinpoint information about a specific gene from a large dataset.
Machine learning is not only used by technology companies and scientists—it is a part of our daily lives. The Amazon shopping and Netflix recommendations that pop up almost instantaneously on our computer and TV screens are a result of complex machine-learning algorithms, and the recommendations are often eerily spot-on. But it is important to remember that getting from data to real information requires a step, says Blei. This is especially true when machine learning is applied to science and medicine.
"We need more work in exploratory data analysis," says Blei, as well as careful validation of algorithms, to avoid making irresponsible conclusions. Interestingly, Blei says that quality of data is not as important to the final result as it might seem; instead, quantity of data is paramount when it comes to drawing conclusions through machine learning. And enormous datasets abound in science—just consider all of the raw data generated by The Human Genome Project.
Now, says Blei, the analysis of data sources (like Twitter) pose an equally big challenge. "Unlike a dataset, a data source has no beginning and no end."
A prediction that doesn't require a complex algorithm? The fields of data visualization and machine learning, as well as other forms of data science, will continue to grow in importance as datasets and data sources get bigger over time and everyone, from neuroscientists to corporations, looks for a way to turn data into meaningful information.
Diana Friedman is executive editor of The New York Academy of Sciences Magazine.
Modeling Our World: Understanding Nature Through Numbers
When seeking to understand whether an invasive species could comfortably settle down and thrive in any given area, it's not highly desirable to test the idea with, say, real African giant pouched rats on a real island in Florida. Even experiments that involve placing organisms, like dangerous parasites, into experimental fields in BL3 or BL4 facilities raise concerns. Lawrence Roberge, associate professor of anatomy and physiology at Laboure College, touts the value of computer modeling as a safe, quick, cost-effective, and most of all, accurate, alternative to real-world ecological experiments.
Using GARP (genetic algorithm rule set prediction) analysis—the "gold standard of ecological modeling for invasive species"—Roberge is able to tell whether an invasive species has the potential to survive and thrive in a particular geographical area—an important defense against potential bioterrorism. "Invasives can cause damage that can weaken a nation. For example, a parasite that affects wheat or corn could lead to changes in the food supply, economic output, and eventually, social unrest." GARP analysis uses a two-step approach that first models an ecological space (abiotic and biotic) and then projects that model into a particular geographical space. "For the model to work right, data collection must be primary," says Roberge, who used the model for his doctoral dissertation on invasive species. "You need really detailed survey data on the species, including its population and distribution, and the geographical space." If such data are available, the model can churn out impressive information about whether a pathogen can "take" to a new location and how fast it could spread.
Ecological modeling is not limited to invasive species. Daniel B. Botkin realized 40 years ago that if he put down all the known information about how trees grow, including physiology, population distribution, and competing species, he could whittle down this information into statements and create a program that is a model of tree growth. Over one summer, and with surprisingly few lines of code, Botkin, professor emeritus, Department of Ecology, Evolution, and Marine Biology, University of California, Santa Barbara, and two colleagues, James Janak, a theoretical physicist, and James Wallis, a hydrologist—both of the IBM Thomas J. Watson Research Center, created this very model.
"Creating a model in a computer is standard science in the sense that the primary statements are hypotheses," says Botkin. In order to learn something new from the model, it must rely on precise data that are readily available and it must be validated, adds Botkin, whose model is now in use in more than 50 countries around the world. Currently, Botkin is working with scientists in Australia who have tree data that are more precise than the data here in the U.S. "This will allow us to extend the model's validation," he says. The model already shows realistic first forest growth when run from scratch; it can create a forest that changes from deciduous to boreal at a specific elevation range; and it has accurately recreated all the forest data that Botkin was able to find.
"If a model is realistic, you can learn a tremendous amount from it," says Botkin. Changing the statements in the model can give you information about forest growth under different conditions that could not be observed over a single lifetime, or even several lifetimes. "However," he cautions, "if a model has nothing to do with reality, it can misguide you."
Safety In Numbers: Can Big Data Preserve Privacy?
It stands to reason that any data worth collecting could also be worth stealing. So in a world of big datasets and di gitized information, is our data safe? "Previously, it you wanted to steal my tax return, you'd have to break into my accountant's office, take it, and physically transport it out of the office. Now, it's certainly easier to steal large amounts of data at one time," says Johannes Gehrke, professor, Department of Computer Science, Cornell University. This is a reality we deal with everyday. Despite countless technologies aimed at thwarting digital data thieves (e.g., encryption, digital rights management, secure hardware), examples of data theft abound: credit cards are disabled and new ones are mailed on a regular basis because massive datasets of credit card numbers are "compromised."
But security is not a one-dimensional concept, says Gehrke. Rather than data simply being secure or not, security represents a ratio of risk and reward. Gehrke gives the example of building a 10-foot fence around your house. Your house might be more secure, but you'd be giving up views of your neighborhood, the curb appeal of your house, and ease of access. Similarly, people strike a balance between risk and reward every time they put data online for the sake of convenience—the ability to shop online or do banking online. Striking this balance is something we do willingly, if not always consciously.
It is the realm of privacy where a larger grey area appears. Privacy, the ability to be left alone, or to hide in a crowd, is a distinct concept from that of security, says Gehrke. And privacy can pose problems for researchers dealing with sensitive information. "For example, a hospital may want to share data about its patients for the benefit of medical research, but you shouldn't be able to tell I was a patient." It can be hard to publish data in such a way that the confidentiality of the person's identity is preserved, but the data is still meaningful, says Gehrke. "Privacy introduces a lot of noise into data. We want a balance of data utility and privacy."
Data privacy is a continually evolving field where algorithms will play an important role. For example, if the hospital wants to publish information about how many of its patients have a given disease, it could take this number, add some noise, then publish this slightly randomized number—and already it becomes much harder to infer whether a particular individual was in the hospital. "Finding algorithms that randomize data such that an attacker can learn little, but a researcher could still learn enough from data is one of the major open problems in data privacy today," says Gehrke.
Gehrke and his colleagues in the Cornell Database Group have recently been approached by World Bank to help generate such privacy-protecting algorithms for use in the developing world. To learn more about the group's work on data privacy, visit www.cs.cornell.edu/bigreddata/privacy/.