Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Think of the SQL GROUP BY functionality as an example. You have a giant terabyte-scale text file that is a CSV of observations (say the raw US census results). You want to group people by some facet (say zipcode and household income) and compute some complex, arbitrary function for each group.

Your map function scans the rows and outputs the kv pair (zipcode+","+income, csv line). All the csv lines in a group go to the same instance of the reduce function, where you can run any code you like (compute averages, do deep learning, etc). The output is the results of what you want to compute for each group.

This is a pretty simple example, but does demonstrate where the power of mr comes from -- arbitrary functions in the map and reduce functions that are allowed restricted one-way communication from the mapper to the reducer. It also should help you understand the glib "you can implement SQL on mapreduce" comment below, which is what Apache Hive does.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: