What does hive not in achieve?

Currently, hive does not support the syntax of including query clauses in in or not in, so it can only be realized through left join.

Suppose there is a login table login(login record on the same day, only one uid), and a user registration table regusers(registered users on the same day, only one uid in the field), both tables contain one field, uid.

in query

If you want to query the registered users who logged in on the same day, you need to use the in query, and you need to use the in query to find the registered users who logged in on the same day. hive sql as follows:

select login.uid from login left outer join regusers on login.uid=regusers.uid where regusers.uid is not null

If the login table and the regusers tables are partitioned by day and the field is dt, then the hive sql to query the registered users who logged in on January 1, 2013 is as follows:

select login.uid from login day_login left outer join

(select uid from regusers where dt='20130101') day_regusers

on day_login.uid=day_regusers.uid where day_login.dt='20130101' and day_regusers.uid is not null

not in query

If you want to query the old users who logged in on the day (assuming here that the non-day_regusers are the old users), you need to use the not in query, hive sql as follows:

select login.uid from login left outer join regusers on login.uid=regusers.uid where regusers.uid is null;

If the login table and the regusers table are partitioned by day, and the field is dt, then to query the old users who logged in on January 1, 2013, the hive sql is as follows:

select login.uid from login day_login left outer join

(select uid from regusers where dt='20130101') day_regusers

on day_login.uid=day_ regusers.uid where day_login.dt='20130101' and day_regusers.uid is null;

Hive join optimization

======================================== ================

Because the business scenarios and underlying technical architectures of hive and traditional relational databases are very different, some of the skills from the traditional database world may no longer be relevant in Hive. About

hive optimization and principles, application of the article, the previous also introduced a number of one after another, but most of them are biased towards the theoretical level, this article introduces an example, from the example step by step to deepen the hive

tuning knowledge and awareness.

1, requirements

Requirements I made a simplified, very simple, two tables to do a join, ask for the specified city, the daily pv, with the traditional RDBMS SQL written out like this:

This is the first time that I've seen this. >7

SELECT t.statdate,

c.cname,

count(t.cookieid)

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON (t.area1= c.cname

OR t.area2 = c.cname

OR t.area3 = c.cname)

WHERE t.statdate>='20140818' AND t.statdate<='20140824'

AND platform='pc'

GROUP BY t.statdate,

c.cname;

How about it? You have no problem reading the requirements based on SQL, right?

2, non-equivalent join problem

Then paste this SQL into hive to execute, then you will find the error:

FAILED: SemanticException [Error 10019]: Line 5:32 OR not supported in JOIN currently 'cname'

This is because hive is limited by the MapReduce algorithmic model, which only supports equi-joins. To realize the non-equivalent joins described above, you can use the full Cartesian product

SELECT t .statdate,

c.cname,

count(t.cookieid)

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t

WHERE t.statdate>= '20140818'

AND t.statdate<='20140824'

AND platform='pc'

AND (t.area1= c.cname

OR t.area2 = c.cname

OR t. area3 = c.cname)

GROUP BY t.statdate,

c.cname;

Then take this statement and execute it.

3. Optimization: reduce side join VS Cartesian product

If you do take this statement and execute it on Hive, and then it just so happens that you have a table that is still very large, then congratulations. The cluster administrator will probably come after you.

A friendly reminder that statements like the Cartesian product should be used with caution in Hive, as the m * n mapping results in big data scenarios are well known. Hive provides an environment variable: hive.mapred.mode=strict; to prevent the Cartesian product from being executed:

FAILED: SemanticException [Error 10052]: In strict mode, cartesian product is not allowed. If you really want to perform the operation, set hive.mapred.mode=nonstrict

From the observation in 2, we know that we follow the join with the on

This is the reduce side join. condition, we are doing a reduce side join, if you follow with where you are doing a Cartesian product, but here we have a single sql

and we can't do a reduce side join, is there any other way?

4, rewrite the non-equivalent join: union all

Since non-equivalent join is not allowed, let's change the idea, multiple sub-queries union all, and then summarize:

21<

SELECT dt,

name,

count(cid)

< p>FROM

(SELECT t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t. area1 =c.cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc'

UNION ALL SELECT t .statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area2 = c.cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc'

UNION ALL SELECT t.statdate dt,

c .cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area3 =c.cname

WHERE t.statdate>=' 20140818'

AND t.statdate<='20140824'

AND platform='pc') tmp_trackflow

GROUP BY dt,

name;

5. OPTIMIZATION: map side join

The above statement is a reduce side join, and we know from our requirements and business that tmpdb.city is a dictionary table with a very small amount of data, so we can try to rewrite the above statement as a mapjoin:

19<

SELECT dt,

name,

count(cid)

FROM

(SELECT /*+ MAPJOIN(c) */ t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area1 =c.cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc'

UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area2 =c.cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc'

UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area3 =c.cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc') tmp_trackflow

GROUP BY dt,

name;

6. Optimization without limits: turn on parallel and control the number of reduces

When the above statement is executed, you can see the execution plan and status information, as well as the combination of your

explain SQL....

...

STAGE DEPENDENCIES:

Stage-11 is a root stage

Stage-1 depends on stages: Stage-11

Stage-2 depends on stages: Stage-1

Stage-3 depends on stages: Stage-2, Stage-6, Stage-9

Stage-12 is a root stage

Stage-5 depends on stages: Stage-12

Stage-6 depends on stages: Stage-5

Stage-13 is a root stage

Stage-8 depends on stages: Stage-13

Stage-9 depends on stages: Stage-8

< p> Stage-0 is a root stage

...

We prefix the SQL with the following environment variable options:

set mapred.reduce.tasks=60;

set hive.exec.parallel=true;

Let the execution plans for Stage-11, Stage-12, and Stage-13 execute in parallel and keep the number of reduce tasks under control.

The complete statement is as follows:

hive -e"

SET mapred.reduce.tasks=60;

SET hive.exec.parallel=TRUE;

SELECT dt,

name,

count(cid)

FROM

(SELECT /*+ MAPJOIN(c) */ t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area1 = c. cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc'

UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area2 = c. cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc'

UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,

c.cname name,

t.cookieid cid

FROM tmpdb.city c

JOIN ecdata.ext_trackflow t ON t.area3 = c. cname

WHERE t.statdate>='20140818'

AND t.statdate<='20140824'

AND platform='pc') tmp_trackflow

GROUP BY dt

name;

" > a1.txt

The final optimization is that the statement in 2 didn't produce results for three hours. 5 is about 8 times faster than 4, 6 is about 2 times faster than 5, and the final result is 10 minutes.

What about Nanjing University of Posts and Telecommunications Yancheng Big Data Research Institute Co.

How to calculate the new college entrance examination assignment system What is the recommended combination of selected subjects

Is it possible for Big Data staff to find out personal ID numbers

Tongnan intelligent pipe gallery big data

Why can't you borrow after paying back the installment?

How to improve the conversion rate and transaction rate of Amazon products?

Which bank is the abbreviation of Microtronic Bank?

Hefei credit office data collection to confirm what it means

Which is better for Zhangye operator big data?

Harbin Institute of Vocational and Technical College to open which majors, Harbin Institute of Vocational and Technical College enrollment list of majors summary