The splits are handled by the client by InputFormat.getSplits, so a look at FileInputFormat gives the following info:
How does Hadoop process records split across block boundaries?
- For each input file, get the file length, the block size and calculate the split size as max(minSize, min(maxSize, blockSize)) where maxSize corresponds to mapred.max.split.size and minSize is mapred.min.split.size.
- Divide the file into different FileSplits based on the split size calculated above. What's important here is that each FileSplit is...
How does Hadoop process records split across block boundaries?