Current location - Loan Platform Complete Network - Big data management - The Role of Hive Partitions
The Role of Hive Partitions

Without the existence of partitions, Hive would perform a full table scan for each query. For a small amount of data, a full table scan is not unbearably slow, but for a large amount of data, such as several years' worth of data, each query would have to scan all the data for several years, which is a waste of time in addition to a waste of cluster resources. To improve this problem, the value of partitioning comes into play. For several years of data, when designing the Hive table, time can be designed as a partition field, as for the time dimension to what granularity, subject to business requirements. In this way, the existence of partitioning, greatly narrowing the scope of the data query, such as partitioning fields in units of days, in the query of March 2020 related data, just limit the time of partitioning fields between 2020-03-01 ~ 2020-03-31, Hive will be based on the conditions of partitioning fields directly find a few years of data attributed to the data in March 2020, and then March 2020 data can be processed according to the specific logic, without the need for several years of data all scanned once.

Difference:

1. Static partitioning is required to specify the partition (not in the source data)

2. Dynamic partitioning is the use of data fields sitting in the partition (source data),? Dynamic partitioning of the primary partition, static partitioning of the secondary partition, so that each primary partition below the creation of static partitions

It is not difficult to see that Hive partitioning, mainly to narrow the scope of the data query, to improve the speed of the query and performance.

Hive's static partitioning, in fact, manually specify the partition's value as a static value, this for small batches of partition insertion is more friendly

The statement partition(year="2020", month= "04", day="2020-04-10", hour="22") in the statement partition(year="2020", month="04", day="2020-04-10", hour="22") the year, month, day, and hour manually specify the specific value, so that the partition is called a static This partition is called a static partition, isn't it?

Hive's dynamic partitioning, in fact, is to set the partition value in the static partition to a dynamic value, you can, to see the dynamic partitioning related HQL

statement partition(year=year, month=month, day=day, hour=hour) will be based on the specific value of the change. There is no need to manually specify, which is a very convenient use for high-volume partition insertion, but the need to measure the number of partitions according to the business needs of the reasonable question. After all, partitions take up IO resources, and the more you have, the more IO resources you consume, and the more you lose query time and performance.

When creating dynamic partitions, often encounter the problem of automatic insertion of partition failure, through the log analysis, you can know that one is not open dynamic partition mode, a strict mode caused by the insertion of dynamic partition failure, a default partition number is not enough to cause the partition insertion failure. Let's understand some parameters related to dynamic partitioning for better use.

-- Hive default configuration value

-- Enables or disables dynamic partitioning

hive.exec.dynamic.partition=false;

-- Set to nonstrict mode to allow all partitions to be dynamically configured, otherwise at least one partition value needs to be specified

< p> hive.exec.dynamic.partitions.mode=strict;

-- Maximum number of dynamic partitions that can be created by a mapper or reducer, exceeded and an error is reported

hive.exec.max.dynamic.partitions.pernode = 100;

-- Maximum total number of dynamic partitions that can be created by a single SQL statement with dynamic partitions, exceeded and an error is reported

hive.exec.max.dynamic.partitions=1000;

-- Maximum number of files that can be created globally, tracked by the Hadoop counter, exceeded and an error is reported

-- Maximum number of files that can be created globally, tracked by the Hadoop counter, if the exceeded, an error is reported

hive.exec.max.created.files=100000;

When you want to perform a partitioned data insertion at a certain time of day, you can consider a combination of dynamic and static partitioning to see how HQL is written.