Learn MapReduce with Playing Cards

https://www.youtube.com/watch?v=bcjSe0xCHbE&ab_channel=JesseAnderson

The key difference of map-reduce on single node vs multiple node is the shuffling magic.

On single node is simple, each type of mapped data are reduced separately

On a node cluster, each node have partitioned groups of data from mapping phase. These data are reorganized between nodes, so each node only need to care about certain groups of data, not all the data. And continues with reduce phase.

this also allows potential for replication, in case some nodes fails
In MapReduce world, we call each group as data with the same key.

Hadoop’s MapReduce on HDFS

breaks large files into smaller chuncks(blocks). think 64kb, or 128kb
various nodes can operate on different chunks of the same file at the same time. Data is processed and mapped based on keys
The magic: When finished, all the data is combined based on the key, and reduced.
Far more efficient than one node operating on a single file
scales linearly to the number of nodes