Current location - Loan Platform Complete Network - Big data management - Big data problem, urgent need for help!
Big data problem, urgent need for help!
Big Data Problems, to be precise, is a space limitation problem under a very large amount of data, and there are 7 solutions as follows (Figure Source Left Cheng Cloud Foundation Class):

First think about the situation with a large HashMap. The key is some integer, the value is the number of times the integer appears, so that you can count the word frequency, and then come up with the TOP10 word frequency. Calculating the memory used at this point, the 4-byte unsigned integer range is 0 to over 4.2 billion (if it's a signed integer the range is -2.1 billion to over 2.1 billion), the range is larger than 4 billion. Worst case if 4 billion are different, at this time HashMap using the space for 4 billion records, each record key (unsigned integer) is 4 bytes, value (word frequency) is also 4 bytes (int type), the total ***8 bytes, a total of 32 billion bytes, that is, 3.2G (1 billion bytes can be estimated as 1G), the hash table exploded.

Here first add the characteristics of the hash function:

Characteristic 1. The input domain is infinite, the output domain is relatively limited.

Feature 2. There is no random component, is a function of the rules of determination. If the input is the same then the output must be the same; different inputs may have the same output (hash collision).

Feature 3. Even if the inputs are very close, the final result of the computation is very discrete and has nothing to do with the input law. This is also the most critical feature.

Feature 4. Output and then mold on a number, the result of taking the mold is also discrete

Inverse 1G memory HashMap can have how many records, conservative point 100 million, means that the HashMap processing contains the number of types (not the number of) do not exceed 100 million kinds of how to deal with it? 4 billion integer file, each number with a hash function to deal with the end of the then Take the mode 100, will only be 0 to 99. According to the hash function features 3, different inputs will be evenly distributed to 0 to 99, 4 billion if the number of different kinds of numbers have K kinds of words, so that after processing, each small file almost 100/k so many kinds of numbers, so that each small file is less than 100 million kinds. Then use HashMap one by one file to deal with word frequency, get 100 files each TOP10, hash function the same input is the same output, so there will not be a number falls into different files.

Using a bitmap, a bit indicates whether a number appears or does not appear. In the case of a hash table, a key-value pair is used to indicate whether a number occurs or not, and both the key and the value take up 4 bytes, so the space taken up by a record is 64 bits (8 bytes). With a bitmap, 1bit represents 1 number, and the number ranges over a large number of bits; 4.2 billion+ bits/8 = 500 million+ bytes = 500+M (1 billion bytes = 1G); take it in 1G space.

Use two bit bits to indicate the frequency of occurrence of a certain number. 00 means that it occurs 0 times; 01 means that it occurs 1 time; 10 means that it occurs 2 times; 11 means that it occurs 3 times, and 11 remains unchanged if the number of occurrences is more than 3 times. In this way, the final statistics can know all the numbers that appear 2 times, compared with the original on the double space, 1G space to take.

Bitmap can not be used, 3KB space is too small. First calculate 3KB can do how long the unsigned array, an unsigned number size of 4B, 3KB/4B = 750, and then 750 from 2 to a certain power which is the closest, 512, then apply for a length of 512 unsigned integer array arr (arr occupies a space size is obviously not more than 3KB). The title of the number range is 0 to the 32nd power of 2 minus one (a *** there are 2 to the 32nd power of so many numbers), because and 512 are the same as 2 to a certain extent, so 2 to the 32nd power can be divided into 512 (each size is 8388608); arr[0] that 512 copies of the 0th (range 0 to 8388607), said that this one on the frequency of the word statistics ; and because a *** only 4 billion numbers, then arr[0] statistics of the number must not overflow (4 billion 2 times 32 minus one = more than 4.2 billion, no sign number is 32 bits); if the statistics of the frequency of all the numbers to the corresponding range of copies, there must be a word frequency is not enough to 83888608; assuming that insufficient that is the first a, then next time to put the 3KB in the first a copy of This range is then divided into 512 copies, and ultimately down, you can always find which number does not appear.

Overall time complexity: the logarithm of the 32nd power of 2 with 512 as the base. This is a very small number. And read the file by line occupies very little memory, read the file is not a one-time load all the files into memory, but in the hard disk file with an offset to find a line of data, read the next line of the previous line of space can be released; so maintain a handle at the end of the sentence and offset can read the file by line.

The entire range is 0 to the 32nd power of 2 minus one. Calculate the midpoint Mid and statistics 0 to Mid range of how many appearances recorded as a, statistics Mid +1 to the end of the range of how many appearances recorded as b; a and b must have a dissatisfaction, dissatisfaction of the one and then bisection, and ultimately will be able to locate a number of figures did not appear, traversing the number of times to the bottom of the 2 2 32nd logarithmic times, i.e., 32 times

In the face of spatial constraints on the topic, from the range of The data situation begins with the idea of sub-interval statistics.

Use the hash function to allocate the URL to many machines to go, the file on each machine and then use the hash function is divided into small files, each small file divided into intervals after the statistics to find the duplicate URL

The use of the heap, the outer sort to do the merger of the results of the results of multiple processing units

Through the 1G memory diversion file, which is used to store 1G hash tables. Hash function characteristics are the same URL will go into a file, the file size for the diversion to 1G can be counted down until the large file diversion of 10 billion URLs into small files. Hash table key is 64 bytes (URL size), value is long type (because it is 10 billion, unsigned integers are not enough) 8 bytes. Then count 1G memory can put up to how many such records, you can know how many different URLs tolerated by the small file at most; thus inversely assuming that 10 billion URLs are different, how many small files are needed to ensure that 1G is not exceeded.

Calculation: 64 + 8 = 72 bytes, the hash table may have an internal index space occupation, you can count the rich a little bit, counted as a record to 100 bytes; 1G = 1 billion bytes, resulting in the hash table up to 10 million records, that is, the record of 10 million different URLs; the worst case of 10 billion URLs are different, 10 billion / 10 million have to need 1,000 small files, the original URL large file with a different URL to ensure that 1G is not exceeded. Then the original URL large file with the hash function and then molded on 1,000, divided into the corresponding small file (according to the nature of the hash function, each small file is almost evenly divided into categories, and the number of records in each file is almost 10 million or so, will not exceed how much). Then in this 1G space, count the TOP100 of word frequencies in each small file, 1,000 files have 1,000 TOP100, and then build a big root heap in each file using word frequencies as the sorting.

The top of each heap and then form a big root heap, constituting a heap on heap, two-dimensional heap (i.e., the binary tree structure in the figure above); for example, the figure above contains three heaps: a, b, c; a, b, c; α, β, θ. The top elements of the heap, a, α, form a big root heap

As shown in the figure above, if you adjust and find that α is the largest, then α is exchanged with a. The string is exchanged between α and a. The string is then outputted, and the word frequencies of the words in each file are sorted. This string is exchanged, the output of α as the entire word frequency in the TOP1.

As shown in the figure above, α output β top up, but β may not be the global maximum, so the top element of the heap composed of the big root of the heap began to heapify; if A at this time is the global maximum, then a string of β that string and β that string of exchanges ...... So the cycle repeats, each time the heap on the heap output a maximum value, the following elements on top, and then the heap on the heap and then adjust, the entire string exchange; two-dimensional heap each time the output of a, output 100 times is TOP100.

If it is a traversal, the cost of time O (100); with the heap structure can be accelerated to O (log100). From here you can see that the outer row decides one thing at a time by traversing the top of each heap once and comparing sizes.

Assuming that the space limit given is 3KB, and the same as before, divided into 512 and each can be counted under the word frequency, the first assumption that these numbers appear a, the second assumption that these numbers appear b, and the third assumption that these numbers appear c, all the segments of the word frequency, and then add up the a, b, and c ...... to see on which range is just over 2 billion or just under 2 billion, and locate the 2 billionth on that range.

For example, if the first i add up to 1.9 billion, the first i + 1 add up to 2.1 billion, then 2 billion on the first i + 1 and is the first i + 1 on the first 100 million, the next in the first i + 1 and then divided into 512 to go to the word frequency statistics, to see which is just over 100 million or just to 100 million, and so on, there is always a statistic out of the time.