Current location - Loan Platform Complete Network - Big data management - Big data command symbol
Big data command symbol
The problem of big data, to be exact, is the problem of space limitation under the large amount of data. There are seven solutions (from the left basic class):

Consider using a big HashMap first. Key is an integer, and value is the number of occurrences of the integer, so that the word frequency can be counted, and then the TOP 10 word frequency can be obtained. When calculating the memory used at this time, the range of 4-byte unsigned integer is 0 to more than 4.2 billion (if it is a signed integer, it is -2 1 billion to 2 1 billion), and the range is more than 4 billion. In the worst case, if the number of 4 billion is different, the space used by HashMap at this time is 4 billion records. In each record, the key (unsigned integer) is 4 bytes, and the value (word frequency) is also 4 bytes (int type), totaling ***8 bytes, totaling 32 billion bytes, that is, 32 billion bytes (1 100 million bytes can be estimated as1g.

The following is a supplement to the characteristics of hash function:

Function 1. The input domain is infinite and the output domain is relatively limited.

Feature 2. There is no random component, it is a function of determining rules. The input is the same, so the output must be the same; Different inputs may have the same output (hash conflict).

Characteristic three. Even if the input is very close, the final calculation result is discrete, which has nothing to do with the input law. This is also the most critical feature.

Feature 4. The output module is a number, and the result of the module is also discrete.

How many records can a HashMap with 1G memory have? The conservative point is 100000000000, which means that the type (not a number) of the inclusion number processed by this HashMap should not exceed10000000000. How to deal with it? For a large file with 4 billion integers, each number is processed by a hash function, and then the modulus 100 is taken, and this number will only be 0 to 99. According to the characteristic 3 of the hash function, different inputs will be evenly distributed from 0 to 99, and the number is 4 billion. If there are k different numbers, it will be almost 100/k in each small file after processing, so there will be less than 1 100 million in each small file. Then use HashMap to process the word frequencies one by one, and get the TOP 10 of 100 files. The same input of the hash function will be the same output, and there will be no case where a number falls into different files. Merge the TOP 10 of the file to get the global TOP 10.

In fact, taking 40 as the modulus is enough. The number of categories of 4 billion is less than or equal to 4 billion, so K/40 is less than or equal to 1 billion, which meets the above requirements of 1G memory, but 100 is safer than 40.

Use a bitmap to indicate whether a certain bit appears in a certain number. If it is a hash table, a key-value pair is needed to indicate whether a number appears or not, and both key values account for 4 bytes, then the space occupied by a record is 64 bits (8 bytes). If you use bitmap, 1bit to represent the number of 1, how many bits are used in the range of the number; More than 4.2 billion bits/8 = more than 500 million bytes = more than 500 m (65.438+0 billion bytes = 654.38+0g); Take it in 1G space.

Use two digits to represent the frequency of a number. 00 means 0 times; 0 1 means 1 time; 10 means it appears twice; 1 1 means it appears three times. If it appears more than three times, 1 1 remains unchanged. In this way, we can know all the figures that appear twice in the final statistics, which is twice as much space as the original, and the space of 1G wins.

Bitmap can't be used, 3KB space is too small. First, calculate how long a 3KB unsigned array can be made. An unsigned number is 4B, 3KB/4B=750, so 750 is closest to the power of 2, which is 5 12. Then apply for an unsigned integer array ARR with a length of 5 12 (the space occupied by ARR obviously does not exceed 3KB). The number range in the title is 0 to the 32nd power of 2 minus 1(a * * * has so many numbers of the 32nd power of 2), because it is a certain power of 2 like 5 12, so the 32nd power of 2 can be divided into 5 12 equally (each size is 8388608); Arr[0] represents the 0 th copy of 5 12 (range 0~8388607), which represents the word frequency statistics of this copy; Moreover, because a * * * only has 4 billion numbers, the number counted by arr[0] will not overflow (4 billion 2 is reduced to the 32nd power1= more than 4.2 billion, and the unsigned number is 32 bits); If the frequencies of all numbers are counted to the corresponding range, then there must be a word with a frequency less than 83888608; Assuming that the insufficient copy is the first copy, then the next time you divide 3KB into 5 12 copies within the scope of the first copy, and finally divide it down, you can always find which number does not appear.

Overall time complexity: the logarithm of 2 to the 32nd power based on 5 12. This is a very small number. And reading files line by line takes up very little memory. Reading files is not to load all files into memory at one time, but to find a line of offset data in the hard disk file, so that the space of the previous line can be released when reading the next line. Therefore, you can read the file line by line by leaving an offset at the end of the handle statement.

The whole range is 0 to 2 to the 32nd power minus 1. Calculate the Midpoint Mid and count how many numbers appear in the range from 0 to mid as A, and count how many numbers appear in the range from Mid+ 1 to the end as B; One of A and B must be dissatisfied. Dissatisfied with the two, you can finally locate a number that does not appear. The number of traversal is 32 times based on 2, that is, 32 times.

Facing the problem of space limitation, this paper starts with the situation of interval data and the idea of interval statistics.

Use hash function to assign URLs to multiple machines, and then use hash function to divide the files on each machine into small files. After counting each small file partition, duplicate URLs will be found.

Merge the results of multiple processing units using heap and external sorting.

Files are distributed through 1G memory, which is used to store hash tables. The characteristic of the hash function is that the same URL will enter a file, and the file size will be diverted to 1G, so that a large file with 1000 billion URLs will be diverted to a small file. The key of the hash table is 64 bytes (URL size), and the value is long (because it is 65.438+000 billion, unsigned integer is not enough) 8 bytes. Then calculate the maximum number of such records that can be put in 1G memory, and you can know the maximum number of different URLs that small files can tolerate; Therefore, it is inferred that10 billion URLs are different, and how many small files are needed to ensure that it does not exceed1g.

Calculation: 64+8=72 bytes. There may be index space occupied in the hash table, so consider enriching it a little. You need 100 bytes to count as a record. 1g = 1000000000000 bytes, and it is concluded that the hash table can hold100000000000 records at most, that is,1000000000000 different URLs are recorded; In the worst case, 100 billion URLs are different, and 100 million requires 100 small files, so the original URL large file is calculated by hash function and then modulo 100 is divided into corresponding small files (according to hash function, Then, in this 1G space, count the files with word frequency before 100 and 100 in each small file, and then build a big root heap in order of word frequency in each file.

Combine the top of each heap with a big root heap to form a heap on the heap and a two-dimensional heap (that is, the binary tree structure in the above figure); For example, the above picture contains a, b and c; Party A, Party B and Party C; There are three piles of α, β and θ, and now the top elements A, A and α form a big pile.

As shown in the above figure, if α is found to be the largest after adjustment, then when α is exchanged with A, the string of α is exchanged with A, and α is output as TOP 1 in the whole word frequency.

As shown in the above figure, after outputting α, β comes up, but β is not necessarily the global maximum, so the big root heap composed of the top elements of the heap begins to heap; ; If A is the global maximum at this time, then the string of A is exchanged with the string of β ... This cycle is repeated, each time a maximum value is output on the heap, the following elements are filled up, and then the whole string is exchanged on the heap. Two-dimensional heap outputs one at a time, and the output of 100 times is TOP 100.

If it is ergodic, the time cost o (100); With the heap structure, it can be accelerated to O(log 100). It can be seen that the outer row determines one thing at a time by traversing the top of each heap and comparing the sizes.

Suppose the given space is limited to 3KB, which is divided into 565,438+02 parts as before, and each part can count the word frequency. The first part assumes that these numbers appear a, the second part assumes that these numbers appear b, and the third part assumes that these numbers appear c, and all paragraphs have word frequencies. Then, add up A, B and C ... to see which range is just over 2 billion or just over 2 billion, and then add them.

For example, if I-share is 1 900 million and i+ 1-share is 2 1 100 million, then 2 billion is on i+ 1 100 million, and then on i+65438+.