The order of execution of sql statements in Hive is as follows:
from ... where ... join ... on ... select ... group by ... select ... having ... distinct ... order by ... limit ... union/union all
Let's analyze this with a sql statement:
The above sql statement can be executed successfully, let's look at the order in which it is executed in MR:
Map phase :
Reduce phase :
Map phase :
Reduce phase :
This is the right order for the execution. Reduce phase :
Is the above execution order correct or not, we can see it by explaining the execution plan, it is too much, let's look at it in stages.
We see that Stage-5 is the root, that is, Stage-5 is executed first, Stage-2 depends on Stage-5, and Stage-0 depends on Stage-2.
The table scanning operation is at ① in the figure, note that table b is scanned first, that is, the table behind the left join, and then filtering operation is carried out (at ② in the figure). In our sql statement, we are filtering table a, but Hive will automatically do the same for table b, which can reduce the amount of associated data.
First scan table a (①); then filter idno > '112233' (②); then do a left join with idno as the key (③); and then do an output operation after performing the join operation, which is three fields, including two fields of select and one field of group by (③); and then do a left join with idno (④); and then do a left join with idno (⑤). The output is three fields, including two fields of select and one field of group by (④); then group by operation, the grouping method is hash (⑤); and then sorting operation, forward sorting according to idno (⑥).
First group by operation, note that the grouping is mergepartial mergepartial mergepartial grouping (① in the figure); then select operation, this time there are only two fields in the output, the number of output rows is 30304 rows (② in the figure); next having filtering operation, filtering out the count_user>1 field, and then filtering out the count_user>1 field, and then filtering out the count_user>1 field. user>1 field, the number of output lines is 10101 (③); then limit to limit the number of output lines (④); ⑤ indicates whether the file is compressed or not, false is not compressed.
Limit the final output to 10 lines.
The analysis of the SQL execution plan above summarizes the following points: