Querying All Your Data
The approach taken by MapReduce may seem like a brute-force蛮力的 approach. The premise前提 is that the entire dataset—or at least a good portion部分 of it—can be processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc特别的 query against your whole dataset and get the results in a reasonable time is transformative有改革能力的. It changes the way you think about data and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights.
For example, Mailtrust, Rackspace’s mail division部门, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In their words:
This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace托管服务 data centers to place new mail servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise在其他方面 would never have had, and furthermore, they were able to use what they had learned to improve the service for their customers.