Image: infocux Technologies/Flickr
When subatomic particles smash together at the Large Hadron Collider in Switzerland, they create showers of new particles whose signatures are recorded by four detectors. The LHC captures 5 trillion bits of data — more information than all of the world’s libraries combined — every second. After the judicious application of filtering algorithms, more than 99 percent of those data are discarded, but the four experiments still produce a whopping 25 petabytes (25×1015 bytes) of data per year that must be stored and analyzed. That is a scale far beyond the computing resources of any single facility, so the LHC scientists rely on a vast computing grid of 160 data centers around the world, a distributed network that is capable of transferring as much as 10 gigabytes per second at peak performance.
Original story reprinted with permission from Quanta Magazine, an editorially independent division of SimonsFoundation.org whose mission is to enhance public understanding of science by covering research developments and trends in mathematics and the physical and life sciences.
The LHC’s approach to its big data problem reflects just how dramatically the nature of computing has changed over the last decade. Since Intel co-founder Gordon E. Moore first defined it in 1965, the so-called Moore’s law — which predicts that the number of transistors on integrated circuits will double every two years — has dominated the computer industry. While that growth rate has proved remarkably resilient, for now, at least, “Moore’s law has basically crapped out; the transistors have gotten as small as people know how to make them economically with existing technologies,” said Scott Aaronson, a theoretical computer scientist at the Massachusetts Institute of Technology.
Instead, since 2005, many of the gains in computing power have come from adding more parallelism via multiple cores, with multiple levels of memory. The preferred architecture no longer features a single central processing unit (CPU) augmented with random access memory (RAM) and a hard drive for long-term storage. Even the big, centralized parallel supercomputers that dominated the 1980s and 1990s are giving way to distributed data centers and cloud computing, often networked across many organizations and vast geographical distances.
These days, “People talk about a computing fabric,” said Stanford University electrical engineer Stephen Boyd. These changes in computer architecture translate into the need for a different computational approach when it comes to handling big data, which is not only grander in scope than the large data sets of yore but also intrinsically different from them.
The demand for ever-faster processors, while important, isn’t the primary focus anymore. “Processing speed has been completely irrelevant for five years,” Boyd said. “The challenge is not how to solve problems with a single, ultra-fast processor, but how to solve them with 100,000 slower processors.” Aaronson points out that many problems in big data can’t be adequately addressed by simply adding more parallel processing. These problems are “more sequential, where each step depends on the outcome of the preceding step,” he said. “Sometimes, you can split up the work among a bunch of processors, but other times, that’s harder to do.” And often the software isn’t written to take full advantage of the extra processors. “If you hire 20 people to do something, will it happen 20 times faster?” Aaronson said. “Usually not.”
Researchers also face challenges in integrating very differently structured data sets, as well as the difficulty of moving large amounts of data efficiently through a highly distributed network.
Those issues will become more pronounced as the size and complexity of data sets continue to grow faster than computing resources, according to California Institute of Technology physicist Harvey Newman, whose team developed the LHC’s grid of data centers and trans-Atlantic network. He estimates that if current trends hold, the computational needs of big data analysis will place considerable strain on the computing fabric. “It requires us to think about a different kind of system,” he said.
Memory and Movement
Emmanuel Candes, an applied mathematician at Stanford University, was once able to crunch big data problems on his desktop computer. But last year, when he joined a collaboration of radiologists developing dynamic magnetic resonance imaging — whereby one could record a patient’s heartbeat in real time using advanced algorithms to create high-resolution videos from limited MRI measurements — he found that the data no longer fit into his computer’s memory, making it difficult to perform the necessary analysis.
Addressing the storage-capacity challenges of big data is not simply a matter of building more memory, which has never been more plentiful. It is also about managing the movement of data. That’s because, increasingly, the desired data is no longer at people’s fingertips, stored in a single computer; it is distributed across multiple computers in a large data center or even in the “cloud.”
There is a hierarchy to data storage, ranging from the slowest, cheapest and most abundant memory to the fastest and most expensive, with the least available space. At the bottom of this hierarchy is so-called “slow memory” such as hard drives and flash drives, the cost of which continues to drop. There is more space on hard drives, compared to the other kinds of memory, but saving and retrieving the data takes longer. Next up this ladder comes RAM, which is must faster than slow memory but offers less space is more expensive. Then there is cache memory — another trade-off of space and price in exchange for faster retrieval speeds — and finally the registers on the microchip itself, which are the fastest of all but the priciest to build, with the least available space. If memory storage were like real estate, a hard drive would be a sprawling upstate farm, RAM would be a medium-sized house in the suburbs, cache memory would be a townhouse on the outskirts of a big city, and the register memory would be a tiny studio in a prime urban location.
Longer commutes for stored data translate into processing delays. “When computers are slow today, it’s not because of the microprocessor,” Aaronson said. “The microprocessor is just treading water waiting for the disk to come back with the data.” Big data researchers prefer to minimize how much data must be moved back and forth from slow memory to fast memory. The problem is exacerbated when the data is distributed across a network or in the cloud, because it takes even longer to move the data back and forth, depending on bandwidth capacity, so that it can be analyzed.
One possible solution to this dilemma is to embrace the new paradigm. In addition to distributed storage, why not analyze the data in a distributed way as well, with each unit (or node) in a network of computers performing a small piece of a computation? Each partial solution is then integrated to find the full result. This approach is similar in concept to the LHC’s, in which one complete copy of the raw data (after filtering) is stored at the CERN research facility in Switzerland that is home to the collider. A second copy is divided into batches that are then distributed to data centers around the world. Each center analyzes its chunk of data and transmits the results to regional computers before moving on to the next batch.
Alon Halevy, a computer scientist at Google, says the biggest breakthroughs in big data are likely to come from data integration. Image: Peter DaSilva for Quanta Magazine
Boyd’s system is based on so-called consensus algorithms. “It’s a mathematical optimization problem,” he said of the algorithms. “You are using past data to train the model in hopes that it will work on future data.” Such algorithms are useful for creating an effective SPAM filter, for example, or for detecting fraudulent bank transactions.
This can be done on a single computer, with all the data in one place. Machine learning typically uses many processors, each handling a little bit of the problem. But when the problem becomes too large for a single machine, a consensus optimization approach might work better, in which the data set is chopped into bits and distributed across 1,000 “agents” that analyze their bit of data and each produce a model based on the data they have processed. The key is to require a critical condition to be met: although each agent’s model can be different, all the models must agree in the end — hence the term “consensus algorithms.”
The process by which 1,000 individual agents arrive at a consensus model is similar in concept to the Mechanical Turk crowd-sourcing methodology employed by Amazon — with a twist. With the Mechanical Turk, a person or a business can post a simple task, such as determining which photographs contain a cat, and ask the crowd to complete the task in exchange for gift certificates that can be redeemed for Amazon products, or for cash awards that can be transferred to a personal bank account. It may seem trivial to the human user, but the program learns from this feedback, aggregating all the individual responses into its working model, so it can make better predictions in the future.
In Boyd’s system, the process is iterative, creating a feedback loop. The initial consensus is shared with all the agents, which update their models in light of the new information and reach a second consensus, and so on. The process repeats until all the agents agree. Using this kind of distributed optimization approach significantly cuts down on how much data needs to be transferred at any one time.
Pages: 1 2 View All