I’ve recently started consulting as a Data Scientist, and wanted to share something interesting with Hadoop. One of the people on my team was having trouble getting her code to run, as it kept running out of space.
The code was forecast to write about 3 GB of output, and her quota was 30 GB, and it ran out of space. The systems guys upped her quota to 100 GB, and yet it still failed. As I helped her dig into the code, it became obvious…
The Pig script chose to use 135 reducers, and each reducer writes an output file to HDFS. The Oracle Big Data Appliance (BDA) had been setup for a 256 MB block size, so this is how the Name Node considers the maximum amount of space that could be used:
(135 reducers) x (256 MB) x (3 replication factor) = 103 GB
So, the solution to making this work within the quota was to reduce the number of reducers, which didn’t hurt the parallelism of the job. Writing smaller blocks can also be done, but wasn’t needed for this job.
As I’m working now with Hadoop, I’ll be posting some more on it.