Massive data analysis and processing methods

A, Bloom filter

Scope of application: can be used to realize the data dictionary, data weight, or set of intersection

Basic principle and key points:

For the principle is very simple, the array of bits + k independent hash function. The hash function corresponding to the value of the bit array set to 1, find if all the hash function corresponding to the bit is 1 that exists, it is clear that this process does not guarantee that the results of the search is 100% correct. At the same time does not support the deletion of a keyword that has been inserted, because the corresponding bits of the keyword will involve other keywords. So a simple improvement would be the counting Bloom filter, which replaces the array of bits with an array of counters, and would support deletion.

Another important question is how to determine the size of the bit array m and the number of hash functions based on the number of input elements n. The error rate is minimized when the number of hash functions k=(ln2)*(m/n). With the error rate not greater than E, m should be at least equal to n*lg(1/E) to represent the set of any n elements. But m should also be larger, because it is also guaranteed that at least half of the bit array is 0. Then m should >=nlg(1/E)*lge roughly nlg(1/E) 1.44 times (lg denotes logarithmic number with base 2).

As an example let's assume an error rate of 0.01, so at this point m should be roughly 13 times n. This makes k roughly 8.

Note that the units of m and n are different here. m is in bits, while n is in number of elements (or number of distinct elements, to be precise). Usually the length of a single element is many bits. So using bloom filter memory wise is usually saved.

Extensions:

Bloom filters map elements in a collection to a bit array, using k (k being the number of hash functions) mapped bits to indicate whether the element is in the collection or not. counting bloom filter (CBF) extends each bit in the bit array to a counter, thus supporting element deletion operations.Spectral Bloom Filter (SBF) associates it with the number of occurrences of a set element.SBF uses the smallest value in the counter to approximate the frequency of an element's occurrence.

Problem example: you are given two files A,B, each storing 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4G, and you are asked to find out the URLs that are the same as the files A,B ****. what if there are three or even n files?

Based on this problem, let's calculate the memory consumption, 4G=2^32 is about 4 billion * 8 is about 34 billion, n = 5 billion, if the error rate of 0.01 counting the need for about 65 billion bits. now available is 34 billion, the difference is not too much, which may make the error rate rise a little. Also if these urlip are one-to-one, they can be converted to ip, which is considerably simpler.

Two, Hashing

Scope of application: fast lookup, deletion of the basic data structure, usually requires a total amount of data can be put into memory

Basic principles and key points:

hash function to choose, for the string, the integer, the arrangement of the specific corresponding hash method.

Collision processing, one is open hashing, also known as the zipper method; the other is closed hashing, also known as the open address method, opened addressing.

Expansion:

d-left hashing in the d is more than one means, let's simplify the problem, take a look at 2-left hashing. 2-left hashing refers to splitting a hash table into two halves of equal length, called T1 and T2, and equipping T1 and T2 with a hash function, h1 and h2, respectively. when storing a new key, the computation is performed with both hash functions at the same time, resulting in two addresses, h1[key] and h2[key]. At this point, it is necessary to check the location h1[key] in T1 and the location h2[key] in T2, which location has already stored more (collision-prone) keys, and then store the new key in the location with less load. If there are as many on both sides, e.g., if both locations are empty or both store a key, the new key is stored in the T1 subtable on the left, and 2-left is derived. When looking up a key, two hashes must be performed, looking up both locations.

Problem example:

1). Massive amount of log data, extract the one IP with the most visits to Baidu on a certain day.

The number of IPs is still limited, up to 2^32, so you can consider using hash to put the ip directly into memory, and then do the statistics.

Three, bit-map

Scope of application: the data can be quickly find, weigh, delete, generally speaking, the data range is 10 times the int

Basic principle and key points: the use of bit arrays to indicate the existence of certain elements, such as 8-bit phone number

Expansion: Bloom Filter can be regarded as a bit-map. is an extension of bit-map

Problem example:

1) A file is known to contain a number of phone numbers, each with 8 digits, count the number of different numbers.

8-bit up to 99 999 999, probably need 99m bits, probably 10 or so m bytes of memory can be.

2) Find the number of non-repeating integers out of 250 million integers, there is not enough memory space for the 250 million integers.

Extending the bit-map a bit, it is enough to represent a number with 2bit, 0 means not occurring, 1 means occurring once, and 2 means occurring 2 times and more. Or we do not use 2bit to indicate that we use two bit-map can simulate the realization of this 2bit-map.

Four, heap

Scope of application: large amount of data before the n large, and n is relatively small, the heap can be put into memory

Basic principles and key points: the largest heap for the first n is small, and the smallest heap for the first n is large. Methods, such as seeking the first n small, we compare the current element with the largest element in the largest heap, if it is smaller than the largest element, it should replace that largest element. So that the last n elements obtained is the smallest n. Suitable for large amounts of data, seeking the first n small, the size of n is relatively small, so that you can scan once can get all the first n elements, high efficiency.

Extension: double heap, a maximum heap combined with a minimum heap, can be used to maintain the median.

Problem example:

1) Find the largest first 100 numbers out of 100w numbers.

Using a minimal heap of size 100 elements is sufficient.

V. Double Bucket Division - Actually, it is essentially the idea of divide and conquer, with heavy emphasis on the skill of division!

Scope of application: the kth large, median, non-repeating or repeating numbers

Basic principles and points: because the range of elements is very large, you can not utilize the direct addressing table, so through the division of a number of times, gradually determine the range, and then finally in an acceptable range. It can be narrowed down by multiple times, the double layer is just an example.

Expansion:

Problem example:

1).Find the number of non-repeating integers out of 250 million integers, there is not enough memory space to hold the 250 million integers.

Sort of like the pigeon's nest principle, the number of integers is 2^32, that is, we can take the 2^32 numbers, divided into 2^8 regions (such as a single file on behalf of a region), and then the data will be separated into different regions, and then different regions in the use of bitmap can be a direct solution. That is to say, as long as there is enough disk space, it can be easily solved.

2).500 million int to find their median.

This example is more obvious than the one above. First we divide the int into 2^16 regions, and then read the data to count the number of numbers that fall into each region, after which we can determine which region the median falls into based on the statistics, and at the same time know what number in the region happens to be the median. Then the second scan will count only the numbers that fall in that region.

In fact, if it's not an int but an int64, we can do this 3 times to get it down to an acceptable level. That is, we can first divide the int64 into 2^24 regions, and then determine the region of the first number, in the region will be divided into 2^20 subregions, and then determine the subregion of the first number, and then the number of numbers in the subregion is only 2 ^ 20, you can directly use the direct addr table for statistics.

Sixth, database index

Scope: the large amount of data additions, deletions and changes

Basic principles and key points: the use of data design and implementation of the method of massive data additions, deletions and changes to the processing.

Seven, inverted index (Inverted index)

Scope: search engines, keyword query

Basic principles and key points: why is called inverted index? An indexing method that is used to store a mapping of where a word is stored in a document or group of documents under full-text search.

Taking English as an example, here is the text to be indexed: T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana "

We then get the following reverse file index:

"a": {2} "banana": {2} "is ": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}

Retrieving the conditions "what ", "is" and "it" will correspond to the intersection of the sets.

Forward indexes are developed to store a list of words for each document. Queries on forward indexes tend to satisfy queries such as frequent full-text queries for each document in an ordered fashion and validation of each word in a checked document. In a forward index, the documents occupy the center position, and each document points to a sequence of index entries that it contains. That is, the document points to those words it contains, while the reverse index has the word pointing to the document that contains it, and it's easy to see the reverse relationship.

Extension:

Problem example: document retrieval system, querying those documents that contain a certain word, such as a common keyword search for academic papers.

VIII, external sorting

Scope of application: sorting of large data, de-duplication

Basic principles and points: external sorting of the subsumption method, replacement selection of the loser tree principle, the optimal subsumption tree

Expansion:

Example of problems:

1). There is a 1G size of a file, in which each line is a word, the size of the word does not exceed 16 bytes, the memory limit size is 1M. return the highest frequency of 100 words.

This data has obvious characteristics, the size of the word is 16 bytes, but the memory is only 1m to do the hash is somewhat insufficient, so it can be used to sort. The memory can be used as an input buffer.

nine, trie tree

Scope: large amount of data, repetition, but the small variety of data can be put into memory

Fundamentals and key points: the way to achieve the node children's representation

Expansion: compression of the implementation.

Problem example:

1). There are 10 files, 1G each, and each line of each file holds the user's query, which may be repeated in each file. You are asked to sort them by the frequency of the query.

2).10 million strings, some of them are the same(duplicates), need to remove all the duplicates and keep the strings without duplicates. How to design and implement?

3). Finding popular queries: the query string has a relatively high degree of repetition, although the total number is 10 million, but if you remove the duplicates, no more than 3 million, each no more than 255 bytes.

X. Distributed processing mapreduce

Scope of application: a large amount of data, but small types of data can be put into memory

Basic principles and key points: the data will be handed over to different machines to deal with the data division, the results of the generalization of the approximation.

Extension:

Problem example:

1).The canonical example application of MapReduce is a process to count the appearances ofeach different word in a set of documents:

2). Massive amount of data is distributed among 100 computers, think of a way to efficiently count the TOP10 of this data.

3). A **** has N machines with N counts on each machine. Each machine stores up to O(N) numbers and operates on them. How do you find the median (median) of the N^2 numbers?

WeChat 14 days track how to query

May I ask, how much do you know about smart libraries?

What are the specialties in secondary school

Can I use hen for braised chicken?

Beijing Changping IT training share big data need to learn what are the things

Why do you need to open your eyes?

Yulin human resources and social security bureau official website

Construction of Security System Based on Big Data

Hangzhou He Zhong Big Data

The company bought five insurance in love enterprise check enterprise annual report can not be found