Introducing the handling of massive data

Introducing the processing of massive data

Scope of application: can be used to realize the data dictionary, data weighting, or set of intersection

Basic principle and key points:

For the principle is very simple, the array of bits + k independent hash function. The hash function corresponding to the value of the bit array set to 1, find if all the hash function corresponding to the bit is 1 that exists, it is clear that this process does not guarantee that the results of the search is 100% correct. At the same time does not support the deletion of a keyword that has been inserted, because the corresponding bits of the keyword will involve other keywords. So a simple improvement would be the counting Bloom filter, which replaces the array of bits with an array of counters, and would support deletion.

Another important question is how to determine the size of the byte array m and the number of hash functions based on the number of input elements n. The error rate is minimized when the number of hash functions k=(ln2)*(m/n). In case the error rate is not greater than E, m should be at least equal to n*lg(1/E) to represent any set of n elements. But m should be bigger, because it is also necessary to make sure that at least half of the bit array is 0. m should >=nlg(1/E)*lge, which is roughly 1.44 times nlg(1/E) (lg is the logarithm of the base 2).

As an example let's assume an error rate of 0.01, so at this point m should be roughly 13 times n. This makes k roughly 8.

Note that the units of m and n are different here. m is in bits, while n is in number of elements (or number of distinct elements, to be precise). Usually the length of a single element is many bits. So using bloom filter memory wise is usually saved.

Extensions:

Bloom filters map elements in a collection to a bit array, using k (k being the number of hash functions) mapped bits to indicate whether the element is in the collection or not. counting bloom filter (CBF) extends each bit in the bit array to a counter, thus supporting element deletion operations.Spectral Bloom Filter (SBF) associates it with the number of occurrences of a set element.SBF uses the smallest value in the counter to approximate the frequency of an element's occurrence.

Problem example: you are given two files A,B, each storing 5 billion URLs, each URL occupies 64 bytes, the memory limit is 4G, so you are asked to find out the URLs that are the same as the files A,B ****. what if there are three or even n files?

Based on this question, let's calculate the memory usage, 4G=2^32 is about 4 billion * 8 is about 34 billion, n = 5 billion, if the error rate of 0.01 is calculated according to the need for about 65 billion bits. Now available is 34 billion, the difference is not too much, which may make the error rate rise a little. Also if these urlip are one-to-one, they can be converted to ip, which is considerably simpler.

2. Hashing

Scope: fast lookups, deletion of the basic data structure, usually need the total amount of data can be put into memory

Basic principles and key points:

hash function to choose, for the string, the integer, the arrangement of the specific corresponding hash method.

Collision processing, one is open hashing, also known as the zipper method; the other is closed hashing, also known as the open address method, opened addressing.

Expansion:

d-left hashing in the d is more than one means, let's simplify the problem, take a look at 2-left hashing. 2-left hashing refers to splitting a hash table into two halves of equal length, called T1 and T2, and equipping T1 and T2 with a hash function, h1 and h2, respectively. when storing a new key, the computation is performed with both hash functions at the same time, resulting in two addresses, h1[key] and h2[key]. At this point, it is necessary to check the location h1[key] in T1 and the location h2[key] in T2, which location has already stored more (collision-prone) keys, and then store the new key in the location with less load. If there are as many on both sides, e.g., if both locations are empty or both store a key, the new key is stored in the T1 subtable on the left, and 2-left is derived. When looking up a key, two hashes must be performed, looking up both locations.

Problem example: 1). Massive log data, extract the IP that has the most visits to Baidu on a certain day.

The number of IPs is still limited, up to 2^32, so you can consider using hash to put the ip directly into memory, and then do the statistics.

3. bit-map

Scope of application: can be used to quickly find the data, weight, delete, generally speaking, the data range is int 10 times less

Basic principle and key points: the use of bit arrays to indicate the presence or absence of certain elements, such as 8-bit phone number

Extension: bloom filter can be regarded as a bit-map. an extension of bit-map

Problem example:

1) A file is known to contain a number of phone numbers, each with 8 digits, count the number of different numbers.

8-bit up to 99,999,999, roughly 99m bits are needed, roughly 10 or so m bytes of memory will do.

2) Find the number of non-repeating integers out of 250 million integers, there is not enough memory space for the 250 million integers.

Extending the bit-map a bit, it is enough to represent a number with 2bit, 0 means not occurring, 1 means occurring once, and 2 means occurring 2 times and more. Or we do not use 2bit to represent, we use two bit-map can simulate the realization of this 2bit-map.

4. heap

Scope: massive data before n large, and n is relatively small, the heap can be put into memory

Basic principles and key points: the largest heap to find the first n is small, and the smallest heap to find the first n is large. Methods, such as seeking the first n small, we compare the current element with the largest element in the largest heap, if it is smaller than the largest element, it should replace that largest element. So that the last n elements obtained is the smallest n. Suitable for large amounts of data, seeking the first n small, the size of n is relatively small, so that you can scan once can get all the first n elements, high efficiency.

Extension: double heap, a maximum heap combined with a minimum heap, can be used to maintain the median.

Problem example:

1) Find the largest first 100 numbers out of 100w numbers.

Using a minimal heap of size 100 elements is sufficient.

5. Double layer bucket division

Scope of application: the kth largest, median, non-repeating or repeating numbers

Basic principle and key points: because the range of elements is very large, you can not use the direct addressing table, so through a number of times the division, gradually determine the range, and then finally in an acceptable range. It can be narrowed down by multiple times, the double layer is just an example.

Extension:

Problem example:

1).Find the number of non-repeating integers out of 250 million integers, there is not enough memory space for the 250 million integers.

Sort of like the pigeon's nest principle, the number of integers is 2^32, that is, we can take the 2^32 numbers, divided into 2^8 regions (such as a single file on behalf of a region), and then the data will be separated into different regions, and then different regions in the use of bitmap can be a direct solution. That is to say, as long as there is enough disk space, it can be easily solved.

2).500 million int to find their median.

This example is more obvious than the one above. First we divide the int into 2^16 regions, and then read the data to count the number of numbers that fall into each region, after which we can determine which region the median falls into based on the statistics, and at the same time know what number in the region happens to be the median. Then the second scan will count only the numbers that fall in that region.

In fact, if it's not an int but an int64, we can do this 3 times to get it down to an acceptable level. That is, we can first divide the int64 into 2^24 regions, and then determine the region of the first number, in the region will be divided into 2^20 sub-regions, and then determine the sub-region of the first number, and then the number of numbers in the sub-region is only 2 ^ 20, you can directly use the direct addr table for statistics.

6. Database index

Scope: the large amount of data additions, deletions and changes

Fundamentals and key points: the use of data design and implementation of the method of massive data additions, deletions and changes in the processing.

Expansion:

Examples of problems:

7. Inverted index (Inverted index)

Scope of application: search engines, keyword query

Basic principles and key points: Why is it called an inverted index? An indexing method that is used to store a mapping of where a word is stored in a document or group of documents under full-text search.

Taking the English language as an example, here is the text to be indexed:

T0 = "it is what it is"

T1 = "what is it"

T2 = "it is a banana"

We then get the following reverse file index:

"a": {2}

"banana ": {2}

"is": {0, 1, 2}

"it": {0, 1, 2}

"what ": {0, 1}

The retrieved conditions "what", "is" and "it" will correspond to the intersection of the sets.

Forward indexes are developed to store a list of words for each document. Queries on forward indexes tend to satisfy queries such as frequent full-text queries for each document in an ordered fashion, and queries such as validation of each word in a checked document. In a forward index, the documents take center stage, and each document points to a sequence of index entries that it contains. That is, a document points to the words it contains, whereas in a reverse index, a word points to the document that contains it, and it is easy to see the reverse relationship.

Extension:

Problem example: document retrieval system, querying those documents that contain a certain word, such as a common keyword search for academic papers.

8. Outer Sort

Scope of application: sorting of big data, de-duplication

Basic principles and key points: outer sort of the merging method, replacement selection Loser Tree principle, the optimal merging tree

Extension:

Problem examples:

1). There is a 1G size of a file, each line in it is a word, the size of the word is not more than 16 bytes, the memory limit size is 1M. return the 100 words with the highest frequency.

This data has obvious characteristics, the size of the word is 16 bytes, but the memory is only 1m to do the hash is somewhat insufficient, so it can be used to sort. The memory can be used as an input buffer.

9. trie tree

Scope of application: large amounts of data, repetition, but the small variety of data can be put into memory

Basic principles and key points: the way to achieve, the representation of the node children

Extension: compression implementation.

Problem example:

1). There are 10 files of 1G each, each line of each file holds the user's query, and the query may be repeated in each file. You are asked to sort the query by its frequency.

2). 10 million strings, some of which are the same (duplicate), need to remove all the duplicates, to retain no duplicate strings. How to design and implement?

3). Find popular queries: the query string has a relatively high degree of repetition, although the total number is 10 million, but if you remove the duplicates, no more than 3 million, each no more than 255 bytes.

10. Distributed processing mapreduce

Scope of application: a large amount of data, but small types of data can be put into memory

Basic principles and key points: the data will be handed over to different machines to deal with the data division, the results of the generalization of the approximation.

Extension:

Problem example:

1).The canonical example application of MapReduce is a process to count the appearances of

each different word in a set of documents:

void map(String name, String document):

// name: document name

// document: document contents

for each word w in document:

EmitIntermediate(w, 1);

void reduce(String word, Iterator partialCounts):

// key: a word

// values: a list of aggregated partial counts. values: a list of aggregated partial counts

int result = 0;

for each v in partialCounts:

result += ParseInt(v);

Emit( result);

Here, each document is split in words, and each word is counted initially with a "1″ value by

the Map function, using the word as the result key. The framework puts together all the pairs

with the same key and feeds them to the same call to Reduce, thus this function just needs to

sum all of its input values to find the total appearances of that word.

2). A huge amount of data is distributed among 100 computers, think of a way to efficiently count the TOP10 of this data.

3). A **** has N machines with N counts on each machine. Each machine stores up to O(N) numbers and operates on them. How do you find the median (median) of N^2 numbers?

Classical problem analysis

Tens of millions or billions of data (with duplicates), counting the top N most frequent occurrences, in two cases: can be read into the memory at once, can not be read at once.

Available ideas: trie tree + heap, database indexing, divide subsets of statistics, hash, distributed computing, approximate statistics, out-of-order

The so-called can be read into memory at a time, in fact, should refer to the amount of data to remove the duplicates. If the data can be put into memory after de-duplication, we can create a dictionary for the data, for example, by map, hashmap, trie, and then just do the statistics. Of course, when updating the number of occurrences of each piece of data, we can use a heap to maintain the top N occurrences, but of course this leads to an increase in the number of times to maintain, not as efficient as completely counting the first N after the first N.

If the data can not be put into memory. On the one hand, we can consider whether the dictionary method above can be improved to suit this situation. A change that can be made is to store the dictionary on the hard disk instead of in memory, which can be referred to the database storage method.

Of course, there is a better way, that is, you can use distributed computing, which is basically the process of map-reduce, first of all, according to the value of the data or the value of the data after the hash (md5), the data will be divided according to the range of different machines, it is best to let the data can be divided into a time to read into the memory, so that the different machines are responsible for dealing with the various value ranges, in fact, is the map. In fact, it is map. after getting the result, each machine only need to take out the first N data from the most frequent occurrence, and then summarize, select all the data in the most frequent occurrence of the first N data, which is actually the reduce process.

In fact, it may be tempting to just divide the data equally among different machines, but this will not result in a correct solution. This is because one piece of data may be distributed evenly across different machines, while another piece of data may be aggregated on a single machine, and there may be an equal number of pieces of data. For example, if we want to find the top 100 occurrences, we distribute 10 million pieces of data to 10 machines, and find the top 100 occurrences on each machine. After merging, this doesn't guarantee that we will find the true 100th one, because, for example, there may be 10,000 occurrences of the 100th one, but it's distributed to 10 machines, so that there are only 1,000 occurrences on each of them, and let's say that these machines rank ahead of the 1,000th one. machines ranked before the 1,000th ones are individually distributed on a single machine, say there are 1,001 of them, so that the one that would have had 10,000 is eliminated, and even if we let each machine pick the 1,000th one with the most occurrences and then subsume it, it will still be an error because there may be a large number of occurrences aggregated with a count of 1,001. So you can't just split the data evenly across different machines, you have to map them to different machines based on the hash values, and let different machines handle a range of values.

An out-of-order approach would consume a lot of IO and would not be very efficient. The distributed approach above, which can also be used for the single-machine version, means that the total data is divided into a number of different sub-files based on the range of values, and then processed one by one. After processing the words and their frequency of occurrence is then performed a merge. In effect an out-of-order merging process can be utilized.

Also consider approximate computation, which means that we can make this scale fit into memory by incorporating natural language properties and using as a dictionary only those words that really occur most often in practice.

What courses do computer majors mainly study?

My floor tiles smell bad every time after mopping, does anyone know why? It will not be harmful to the body, right?

Why does the country prevent Ant Financial from going public?

Cell phone card left at home to go out of town will the trip code show up

Famous anchor Li Jiaqi with goods overturned by the fine, what do you think about this?

Ali Interview: Tell me about the MQ you use in your projects, and the role of MQ in distributed systems?

Monitoring year-end work summary

World Record: How many sit-ups can you do at most?

Shangqiu Jiutang House sales department phone number is how many

Zhu Ting's four big data shocked the audience.