Suppose there is a login table login(login record on the same day, only one uid), and a user registration table regusers(registered users on the same day, only one uid in the field), both tables contain one field, uid.
in query
If you want to query the registered users who logged in on the same day, you need to use the in query, and you need to use the in query to find the registered users who logged in on the same day. hive sql as follows:
select login.uid from login left outer join regusers on login.uid=regusers.uid where regusers.uid is not null
If the login table and the regusers tables are partitioned by day and the field is dt, then the hive sql to query the registered users who logged in on January 1, 2013 is as follows:
select login.uid from login day_login left outer join
(select uid from regusers where dt='20130101') day_regusers
on day_login.uid=day_regusers.uid where day_login.dt='20130101' and day_regusers.uid is not null
not in query
If you want to query the old users who logged in on the day (assuming here that the non-day_regusers are the old users), you need to use the not in query, hive sql as follows:
select login.uid from login left outer join regusers on login.uid=regusers.uid where regusers.uid is null;
If the login table and the regusers table are partitioned by day, and the field is dt, then to query the old users who logged in on January 1, 2013, the hive sql is as follows:
select login.uid from login day_login left outer join
(select uid from regusers where dt='20130101') day_regusers
on day_login.uid=day_ regusers.uid where day_login.dt='20130101' and day_regusers.uid is null;
Hive join optimization
======================================== ================
Because the business scenarios and underlying technical architectures of hive and traditional relational databases are very different, some of the skills from the traditional database world may no longer be relevant in Hive. About
hive optimization and principles, application of the article, the previous also introduced a number of one after another, but most of them are biased towards the theoretical level, this article introduces an example, from the example step by step to deepen the hive
tuning knowledge and awareness.
1, requirements
Requirements I made a simplified, very simple, two tables to do a join, ask for the specified city, the daily pv, with the traditional RDBMS SQL written out like this:
1
2
3
4
5
6
7
6
This is the first time that I've seen this. >7
8
9
10
11
SELECT t.statdate,
c.cname,
count(t.cookieid)
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON (t.area1= c.cname
OR t.area2 = c.cname
OR t.area3 = c.cname)
WHERE t.statdate>='20140818' AND t.statdate<='20140824'
AND platform='pc'
GROUP BY t.statdate,
c.cname;
How about it? You have no problem reading the requirements based on SQL, right?
2, non-equivalent join problem
Then paste this SQL into hive to execute, then you will find the error:
1
FAILED: SemanticException [Error 10019]: Line 5:32 OR not supported in JOIN currently 'cname'
This is because hive is limited by the MapReduce algorithmic model, which only supports equi-joins. To realize the non-equivalent joins described above, you can use the full Cartesian product
1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT t .statdate,
c.cname,
count(t.cookieid)
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t
WHERE t.statdate>= '20140818'
AND t.statdate<='20140824'
AND platform='pc'
AND (t.area1= c.cname
OR t.area2 = c.cname
OR t. area3 = c.cname)
GROUP BY t.statdate,
c.cname;
Then take this statement and execute it.
3. Optimization: reduce side join VS Cartesian product
If you do take this statement and execute it on Hive, and then it just so happens that you have a table that is still very large, then congratulations. The cluster administrator will probably come after you.
A friendly reminder that statements like the Cartesian product should be used with caution in Hive, as the m * n mapping results in big data scenarios are well known. Hive provides an environment variable: hive.mapred.mode=strict; to prevent the Cartesian product from being executed:
1
FAILED: SemanticException [Error 10052]: In strict mode, cartesian product is not allowed. If you really want to perform the operation, set hive.mapred.mode=nonstrict
From the observation in 2, we know that we follow the join with the on
This is the reduce side join. condition, we are doing a reduce side join, if you follow with where you are doing a Cartesian product, but here we have a single sql
and we can't do a reduce side join, is there any other way?
4, rewrite the non-equivalent join: union all
Since non-equivalent join is not allowed, let's change the idea, multiple sub-queries union all, and then summarize:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21<
22
23
24
25
26
27
28
29
30
SELECT dt,
name,
count(cid)
< p>FROM(SELECT t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t. area1 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT t .statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area2 = c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT t.statdate dt,
c .cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area3 =c.cname
WHERE t.statdate>=' 20140818'
AND t.statdate<='20140824'
AND platform='pc') tmp_trackflow
GROUP BY dt,
name;
5. OPTIMIZATION: map side join
The above statement is a reduce side join, and we know from our requirements and business that tmpdb.city is a dictionary table with a very small amount of data, so we can try to rewrite the above statement as a mapjoin:
1
2
3
4<
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19<
20
21
22
23
24
25
26
27
28
29
30
SELECT dt,
name,
count(cid)
FROM
(SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area1 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area2 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area3 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc') tmp_trackflow
GROUP BY dt,
name;
6. Optimization without limits: turn on parallel and control the number of reduces
When the above statement is executed, you can see the execution plan and status information, as well as the combination of your
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
explain SQL....
...
STAGE DEPENDENCIES:
Stage-11 is a root stage
Stage-1 depends on stages: Stage-11
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2, Stage-6, Stage-9
Stage-12 is a root stage
Stage-5 depends on stages: Stage-12
Stage-6 depends on stages: Stage-5
Stage-13 is a root stage
Stage-8 depends on stages: Stage-13
Stage-9 depends on stages: Stage-8
Stage-9 depends on stages: Stage-8
< p> Stage-0 is a root stage...
We prefix the SQL with the following environment variable options:
1
2
set mapred.reduce.tasks=60;
set hive.exec.parallel=true;
Let the execution plans for Stage-11, Stage-12, and Stage-13 execute in parallel and keep the number of reduce tasks under control.
The complete statement is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
hive -e"
SET mapred.reduce.tasks=60;
SET hive.exec.parallel=TRUE;
SELECT dt,
name,
count(cid)
FROM
(SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area1 = c. cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area2 = c. cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area3 = c. cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc') tmp_trackflow
GROUP BY dt
name;
" > a1.txt
The final optimization is that the statement in 2 didn't produce results for three hours. 5 is about 8 times faster than 4, 6 is about 2 times faster than 5, and the final result is 10 minutes.