Current location - Loan Platform Complete Network - Big data management - How to optimize the operation efficiency of mapreduce job
How to optimize the operation efficiency of mapreduce job
The optimization of MapReduce program mainly focuses on two aspects: one is the optimization of computing performance; the other is the optimization of IO operations.

Specifically embodied in the following several aspects:

1. Task scheduling

a. Try to select idle nodes for computation

b. Try to assign the task to the machine where the InputSplit is located

2. Data preprocessing and InputSplit size

Try to process a small amount of large data; rather than a large number of large data. data; rather than a large amount of small data. So you can preprocess the data once before processing and merge the data.

If you are too lazy to merge the data yourself, you can refer to using the CombineFileInputFormat function. Please refer to the relevant function manual for specific usage.

3. Number of Map and Reduce tasks

The number of tasks in the Map task slot needs to refer to the running time of the Map, while the number of Reduce tasks only needs to refer to the number of tasks in the Map slot, which is usually 0.95 or 1.75 times.

4. Use the Combine function

This function is used to merge data locally and can greatly reduce network consumption. Please refer to the function manual for details.

5. Compression

You can compress some intermediate data to reduce network consumption.

6. customcomparator

You can customize the data type to achieve more complex purposes.