Data!
We live in the data age. It’s not easy to measure测量 the total volume of data stored electronically, but an IDC国际文献资料中心 estimate put the size of the “digital universe” at 4.4 zettabytes泽字节(ZB) in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes.1 A zettabyte is 1021 bytes, or equivalently相等地 one thousand exabytes艾字节(EB), one million petabytes拍字节(PB), or one billion terabytes太字节(TB). That’s more than one disk drive for every person in the world.
This flood of data is coming from many sources. Consider the following:2
- The New York Stock Exchangen纽约股票交易所 generates about 4−5 terabytes of data per day.
- Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
- Ancestry.com, the genealogy site, stores around 10 petabytes of data.
- The Internet Archive互联网档案馆 stores around 18.5 petabytes of data.
- The Large Hadron Collider大型强子对撞机 near Geneva日内瓦, Switzerland瑞士, produces about 30 petabytes of data per year.
So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the data is locked up in the largest web properties (like search engines) or in scientific or financial institutions, isn’t it? Does the advent出现 of big data affect smaller organizations or individuals个体?
I argue that it does. Take photos, for example. My wife’s grandfather was an avid photographer狂热的摄影师 and took photographs throughout his adult life. His entire corpus of mediumformat, slide, and 35mm film,他所有的影集收藏 when scanned in at high resolution高分辨率扫描, occupies around 10 gigabytes. Compare this to the digital photos my family took in 2008, which take up about 5 gigabytes of space. My family is producing photographic data at 35 times the rate my wife’s grandfather’s did, and the rate is increasing every year as it becomes easier to take more and more photos.
More generally, the digital streams that individuals are producing are growing apace. Microsoft Research’s MyLifeBits project gives a glimpse一瞥 of the archiving of personal information that may become commonplace in the near future. MyLifeBits was an experiment实验 where an individual’s interactions个体的交流—phone calls, emails, documents—were captured electronically and stored for later access. The data gathered聚集 included a photo taken every minute, which resulted in an overall全部的 data volume of 1 gigabyte per month. When storage costs come down enough to make it feasible可行的 to store continuous audio and video, the data volume for a future MyLifeBits service will be many times that.
The trend is for every individual’s data footprint足迹 to grow, but perhaps more significantly显著地, the amount of data generated by machines as a part of the Internet of Things will be even greater than that generated by people机器产生的数据量比人产生的多得多. Machine logs, RFID射频识别 readers, sensor传感器 networks, vehicle车辆 GPS traces痕迹, retail零售 transactions处理—all of these contribute to the growing mountain of data.
The volume of data being made publicly available increases every year, too. Organizations no longer have to merely仅仅 manage their own data; success in the future will be dictated口述的 to a large extent范围 by their ability to extract提取 value from other organizations’ data.
Initiatives举措 such as Public Data Sets on Amazon Web Services and Infochimps.org exist to foster培养 the “information commons,” where data can be freely (or for a modest适度的 price) shared for anyone to download and analyze. Mashups混合 between different information sources make for unexpected意外的 and hitherto迄今 unimaginable applications.
Take, for example, the Astrometry.net project, which watches the Astrometry group on Flickr for new photos of the night sky. It analyzes each image and identifies which part of the sky it is from, as well as any interesting celestial bodies天体, such as stars or galaxies星系. This project shows the kinds of things that are possible when data (in this case, tagged photographic images) is made available and used for something (image analysis) that was not anticipated预料 by the creator.
It has been said that “more data usually beats better algorithms嘻,” which is to say that for some problems (such as recommending movies or music based on past preferences), however fiendish极坏的 your algorithms, often they can be beaten simply by having more data (and a less sophisticated复杂的 algorithm).3
The good news is that big data is here. The bad news is that we are struggling to store and analyze it.
1. These statistics统计 were reported in a study entitled命名为 "The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things." ↩
2. All figures are from 2013 or 2014. For more information, see Tom Groenfeldt, “At NYSE, The Data Deluge Overwhelms Traditional Databases”; Rich Miller, “Facebook Builds Exabyte Data Centers for Cold Stor‐ age”; Ancestry.com’s “Company Facts”; Archive.org’s “Petabox”; and the Worldwide LHC Computing Grid project’s welcome page*. ↩
3. The quote引用 is from Anand Rajaraman’s blog post “More data usually beats better algorithms,” in which he writes about the Netflix Challenge. Alon Halevy, Peter Norvig, and Fernando Pereira make the same point in “The Unreasonable不合理的 Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009. ↩