Hive SQL statement execution order

The order of execution of sql statements in Hive is as follows:

from ... where ... join ... on ... select ... group by ... select ... having ... distinct ... order by ... limit ... union/union all

Let's analyze this with a sql statement:

The above sql statement can be executed successfully, let's look at the order in which it is executed in MR:

Map phase :

Reduce phase :

Map phase :

Reduce phase :

This is the right order for the execution. Reduce phase :

Is the above execution order correct or not, we can see it by explaining the execution plan, it is too much, let's look at it in stages.

We see that Stage-5 is the root, that is, Stage-5 is executed first, Stage-2 depends on Stage-5, and Stage-0 depends on Stage-2.

The table scanning operation is at ① in the figure, note that table b is scanned first, that is, the table behind the left join, and then filtering operation is carried out (at ② in the figure). In our sql statement, we are filtering table a, but Hive will automatically do the same for table b, which can reduce the amount of associated data.

First scan table a (①); then filter idno > '112233' (②); then do a left join with idno as the key (③); and then do an output operation after performing the join operation, which is three fields, including two fields of select and one field of group by (③); and then do a left join with idno (④); and then do a left join with idno (⑤). The output is three fields, including two fields of select and one field of group by (④); then group by operation, the grouping method is hash (⑤); and then sorting operation, forward sorting according to idno (⑥).

First group by operation, note that the grouping is mergepartial mergepartial mergepartial grouping (① in the figure); then select operation, this time there are only two fields in the output, the number of output rows is 30304 rows (② in the figure); next having filtering operation, filtering out the count_user>1 field, and then filtering out the count_user>1 field, and then filtering out the count_user>1 field. user>1 field, the number of output lines is 10101 (③); then limit to limit the number of output lines (④); ⑤ indicates whether the file is compressed or not, false is not compressed.

Limit the final output to 10 lines.

The analysis of the SQL execution plan above summarizes the following points:

What are the specialties of Guangzhou South College and how are they ranked?

On how banks guard against and control risks.

Samsung i8190 GPRS Internet problem

What specialties are there in new energy professional schools?

What is the most profitable software to do questionnaires?

The same starting line to start, digging medical, Internet giants hit which stops

What does big data do?

What platform is the digital economy that Zhu Shaoping is talking about

I want to ask why the horned frog will spit out all the food after I feed it (I don't know how to get it out anyway).

Can credit report check all online loans in your name?