Top 10 Ways to Optimize Hive

Hive used well to dig out more information from the data. Friends who have used hive, I think more or less have a similar experience: a day down, did not run a few times hive, to the end of the day time. hive in the very large data or data imbalance and so on, the performance tends to be general, so also appeared in the presto, spark-sql and other alternatives. The focus here is on how hive is optimized, for example

I. Table join optimization

ii. Replace union all with insert into

If the number of union all parts is greater than 2, or each union part of a large amount of data, it should be split into multiple insert into statements, the actual test process, the execution time can be improved by 50%. The example reference is as follows:

It can be rewritten as follows:

iii. order by & sort by

order by : global sorting of query results consumes a long time, you need to set hive.mapred.mode=nostrict

sort by : local sorting, not globally ordered, to improve efficiency.

iv. transform+python

A custom function embedded in the hive fetching process, through the transform statement can be in hive in the hive is not convenient to achieve the function in python, and then written to the hive table. Example syntax is as follows:

If there are other dependent resources besides the python script, you can use ADD ARVHIVE.

V. Limit statement for fast results

In general, the Limit statement still needs to execute the entire query statement and then return partial results. There is a configuration property that can be turned on to avoid this - sampling the data source

Disadvantage: it is possible that some of the data will never be processed

VI. Local mode

For small datasets, the time consumed to execute a task for a query trigger > the time to actually execute the job, so it is possible to process all the tasks on a single machine (or in some cases on a single process) by using local mode.

Hive can be made to automatically start this optimization when appropriate by setting the value of the property hive.exec.mode.local.auto to true, or this configuration can be written in the $HOME/.hiverc file.

Local mode can only really be used when a job meets the following conditions:

vii. Parallel execution

Hive transforms a query into one or more phases, including: a MapReduce phase, a sampling phase, a merge phase, a limit phase, and so on. By default, only one stage is executed at a time. However, it is possible to execute some stages in parallel if they are not dependent on each other.

It will be more resource intensive.

viii. Adjusting the number of mappers and reducers

Suppose there is a file a in the input directory with a size of 780M, then hadoop will separate the file a into 7 blocks (6 blocks of 128m and 1 block of 12m), resulting in 7 maps

Suppose there are 3 files a,b,c in the input directory with sizes of 10m, 20m and 20m respectively. 10m, 20m, 130m respectively, then hadoop will separate them into 4 blocks (10m,20m,128m,2m), thus producing 4 map numbers.

That is, if the file is larger than the block size (128m), then it will be split, and if it is smaller than the block size, then the file will be treated as a block.

map execution time: time for map task startup and initialization + time for logical processing.

Reduce the number of maps

If there are a large number of small files (less than 128M), multiple maps will be generated, the processing method is:

The first three parameters to determine the size of the merged file block, greater than the block size of the file size of 128m, according to the 128m to be separated, less than 128m,greater than 100m. According to the 100m to separate, those less than 100m (including small files and separated from the rest of the large file) to merge.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; - Merge small files before execution.

Increasing the number of maps

When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to make the amount of data processed by each map decrease, thus increasing the efficiency of the task execution.

set mapred.reduce.tasks=?

Generally based on the total size of the input file, use its estimation function to automatically calculate the number of reducers: number of reducers = InputFileSize / bytes per reducer

IX. Strict mode

x. Data skew

Performance:

The task progress stays at 99% (or 100%) for a long time, and a look at the task monitoring page reveals that only a small number (1 or a few) of reduce subtasks are outstanding. This is because the difference between the amount of data it processes and other reduces is too large. The difference between the number of records in a single reduce and the average number of records is too large, which can usually be 3 times or more. The maximum duration is greater than the average duration.

Cause:

Solution: Parameter tuning

Please give 3 elementary school addition combining law application problems

How to quickly, multiple simultaneous copy copy 2T hard disk large data

What is purity? How to change the purity of a solid color?

What are the benefits of public opinion monitoring?

Amagi Koshoe Lyrics

What are the professions now?

How to use cloudera manager to take over the existing hadoop cdh version of the cluster

Has Harrow Bicycle completed a nearly $1 billion round of financing?

Under what circumstances will a state-owned enterprise fail to pass the back investigation?

What about Beyondsoft?