Current location - Loan Platform Complete Network - Big data management - Big data is slow.
Big data is slow.
This is actually a problem of finding disjoint (difference set), and the method of finding difference set by sql statement is much less than finding intersection.

In the difference set method, using the not in keyword for filtering is logically the easiest to understand, and many people will think of using it, which is possible in the case of a small amount of data, but it has a major defect, that is, it is extremely inefficient when encountering large data tables, and whether it can be used is also very poor. I found that in the test using big data tables, it often takes several hours for the not in statement to return the results, and the most exaggerated example takes more than one day! Before the results are returned, the data query will be in a state of "suspended animation", which makes people feel as if they have returned an empty set. Actually, it's not like this. It's just that the database engine didn't finish the operation.

When there is an index that can be used, we can use the nonexistent not exists clause to filter out the difference set between two tables, which is very efficient. Taking the sentence of the subject as an example, it can be rewritten as follows:

Originally, the difference set was filtered by not in, and the efficiency of large data table was extremely poor:

Select ipdz from ipdz_b (select ipdz _ d); from zj_b) where ipdz is absent;

When using not exists to filter the difference set, it will be much faster to return the results if there is an index available in the large data table:

Select b.ipdz (

Select/kloc-0 from zj_b a/where a.ipdz _ d = b.ipdz);

Please be careful not to step on the nonexistent pit! Although the running efficiency is very high when there is an available index, if there is no available index, it will run as bad as no big data table!

When there is no index to use, it is suggested to use the null value in the left (right) join to find the difference set, but we need to pay attention to and handle the problem of more record rows caused by the join of two tables carefully.

The following is a method of writing sql statements taking the table structure of accounts as an example, and the speed of returning the result set is still very good:

Left Join Filter Difference Set:

Select b.ipdz from ipdz_b b left connection ZJ _ b a.

A.ipdz_d=b.ipdz, where a. ipdz _ d is empty;

It is assumed here that ipdz_d of Table A is unique. If it is not unique, the following adjustments need to be made.

Select b.ipdz (

Select different ipzd_d from ZJ _ b) a.

A.ipdz_d=b.ipdz, where a. ipdz _ d is empty;

Summary:

A small amount of data is not used casually. This method is simple in logic and easy to write.

When there are a large number of indexes available, it is recommended to choose not exists first (because it is the most efficient);

When there is a lot of data, we should avoid using the index, rather than seeing whether it is available or not. When there is no available index, you should also avoid Not exists. At this time, it is suggested to use Zuo Zuo join or right join to return the difference set, which has better performance.

The latter two methods are not easy to understand logically, and they also have to deal with the problem of many record lines caused by connection, so the statement writing is relatively troublesome.