Big data sorting or fetching or de-weighting related problems

Big Data Sorting or Fetching or De-weighting Related Problems

1. Given two files, a and b, each storing 5 billion urls, each url taking up 64 bytes each, and the memory limit is 4G, let you find out the url that is the same as the one in file a and b ****?

Scheme 1: It can be estimated that the size of each file an be 50G×64=320G, which is much larger than the memory limit of 4G. of 4G. so it is impossible to load them completely into memory for processing. Consider a divide-and-conquer approach.

s Iterate through the file a, find , for each url, and then store the url in 1000 small files (denoted as ) according to the values obtained. This makes each small file about 300M.

s traverses file b, and stores the url in each of the 1000 small files (denoted as ) in the same way as a. This way, all possible identical urls are processed. After this process, all possible identical urls are in the corresponding small file ( ), and there can be no identical url in a small file that does not correspond to it. then we only need to find the identical url in each of the 1000 pairs of small files.

s To find the same url in each pair of small files, store the url of one of the small files in hash_set. Then iterate through each url of the other small file to see if it's in the hash_set you just constructed, and if it is, then it's the *** same url, and just store it inside the file.

Option 2: If you allow a certain error rate, you can use Bloom filter, 4G memory can represent roughly 34 billion bits. map the url in one of the files to this 34 billion bits using Bloom filter, and then read the url of the other file one by one, check if it is the same as the Bloom filter, and if it is, then the url should be the same as the Bloom filter. yes, then the url should be the *** same url (note that there will be some error rate).

2. There are 10 files, each file 1G, each line of each file stores the user's query, each file's query may be repeated. You are asked to sort the query by its frequency. Option 1:

s Read 10 files in order and write the query to another 10 files (denoted as ) according to hash(query)%10. The size of each of the newly generated files is also about 1G (assuming the hash function is randomized).

s Find a machine with about 2G of RAM, and use hash_map(query, query_count) to count the number of occurrences of each query in turn. Sort the query by the number of occurrences using fast/heap/summary sort. Output the sorted query and the corresponding query_cout to a file. This yields 10 sorted files (denoted as ).

s Perform a subsumption sort (a combination of inner and outer sorts) on these 10 files.

Scheme 2:

The total number of query is generally limited, it is just that the number of repetitions is higher, and it may be possible to add to the memory for all the query at once. In this way, we can use trie tree/hash_map etc. to directly count the number of times each query occurs, and then just do a fast/heap/summary sort by the number of occurrences.

Option 3:

Similar to option 1, but after doing the hash and splitting into multiple files, it can be handed over to multiple files to be processed using a distributed architecture (e.g., MapReduce), and then merged at the end.

3. There is a 1G size of a file, in which each line is a word, the size of the word is not more than 16 bytes, the memory limit size is 1M. return to the highest frequency of 100 words.

Program 1: sequential reading of the file, for each word x, take , and then in accordance with the value stored in 5000 small files (recorded as ). This is about 200k per file. If some of the file more than 1M size, but also in accordance with similar methods to continue to divide down, know the decomposition of the size of the small file are not more than 1M. for each small file, statistics for each file in the word and the corresponding frequency (can be used trie tree/hash_map, etc.), and take out the frequency of the largest 100 words (can be used with the smallest heap of 100 nodes), and the 100 words and the corresponding frequency (can be used in the smallest heap), and the 100 words and the corresponding frequency (can be used in the smallest heap of 100 nodes). ), and put the 100 words and their corresponding frequencies into the file, so that we get another 5000 files. The next step is to merge these 5000 files (similar to the process of merge sorting).

4. Massive log data, extract the IP of the most visited Baidu on a particular day.

Program 1: First of all, this day, and is to visit Baidu's logs of the IP is taken out, one by one to write to a large file. Note that the IP is 32-bit, there are up to one IP. the same can be used to map the method, such as die 1000, the entire large file mapped to 1000 small files, and then find out the frequency of each small text of the largest IP (you can use the hash_map frequency statistics, and then find out the frequency of the largest several) and the corresponding frequency. Then in the 1000 largest IP, find the IP with the largest frequency, that is, the desired.

5. Find the non-repeating integers among 250 million integers, the memory is not enough to hold these 250 million integers.

Option 1: Use 2-Bitmap (each number is assigned 2 bits, 00 means not present, 01 means appearing once, 10 means many times, 11 is meaningless) to do this, **** memory memory is needed, it is still acceptable. Then scan the 250 million integers, view the Bitmap in the relative corresponding bit, if it is 00 to 01, 01 to 10, 10 remains unchanged. After the description is done, view the bitmap and output the integer whose corresponding bit is 01.

Program 2: you can also use a similar method to the previous question, the method of division of small files. Then find the integers that are not duplicated in the small file and sort them. Then perform the merging, taking care to remove the duplicate elements.

6. Massive amount of data distributed in 100 computers, think of a way colleges and universities to count the TOP10 of this data.

Option 1:

s To find the TOP10 on each computer, you can use a heap containing 10 elements to complete the (TOP10 is small, with the largest heap, TOP10 is large, with the smallest heap). For example, to find TOP10 big, we first take the first 10 elements and adjust it to the minimum heap, if found, then scan the data behind and compare it with the top element of the heap, if it is bigger than the top element of the heap, then replace the top of the heap with the element, and then adjust it to the minimum heap again. The final element in the heap is the TOP10 large.

s After finding the TOP10 on each computer, then combine the TOP10 on those 100 computers, ****1000 data, and then use the similar method above to find the TOP10 can be.

7. How to find the one with the most repetitions in a huge amount of data?

Scheme 1: first do hash, and then find the mode mapping into small files, find the one with the most repetitions in each small file, and record the number of repetitions. Then find out the previous step to find the data in the highest number of repetitions is the requested (specific reference to the previous question).

8. tens or hundreds of millions of data (with duplicates), count the money N data with the most occurrences among them.

Option 1: Tens or hundreds of millions of data, the memory of today's machines should be able to store. So consider using hash_map/search binary tree/red-black tree etc. to count the number of times. Then it's time to take out the first N occurrences of the most data, which can be done using the heap mechanism mentioned in question 6.

9. 10 million strings, some of which are duplicates, need to remove all the duplicates and keep the strings that are not duplicates. How to design and implement it please?

Option 1: A trie tree is more appropriate for this question, and hash_map should also work.

10. A text file, about 10,000 lines, one word per line, asked to count the top 10 most frequent words in it, please give the idea, give the time complexity analysis.

Option 1: This question is considering time efficiency. The number of occurrences of each word is counted using a trie tree with a time complexity of O(n*le) (le denotes the leveled length of the word). Then is to find out the most frequent occurrence of the first 10 words, you can use the heap to realize, the previous question has been talked about, the time complexity is O(n * lg10). So the total time complexity is whichever of O(n*le) and O(n*lg10) is greater.

11. A text file, to find the first 10 frequently occurring words, but this time the file is longer, say hundreds of millions or billions of lines, in short, can not be read into memory at once, ask the optimal solution.

Program 1: first of all, according to the use of hash and modeling, the file is broken down into multiple small files, for a single file using the method of the previous question to find out the 10 most frequently occurring words in each file pieces. And then the subsumption process to find the final 10 most frequently occurring words.

12. Find the largest 100 out of 100w numbers.

Option 1: In the previous question, we have mentioned that it is done with a minimal heap containing 100 elements. The complexity is O(100w*lg100).

Option 2: using the idea of fast sorting, after each split only consider a part larger than the axis, know that a part larger than the axis in more than 100, using the traditional sorting algorithm sorting, take the first 100. The complexity is O(100w*100).

Option 3: Use partial elimination. The first 100 elements are selected and sorted, noted as sequence L. Then the remaining elements x are scanned one at a time and compared with the smallest of the sorted 100 elements, if it is larger than this smallest, then this smallest element is deleted and x is inserted into the sequence L using the idea of insertion sorting. Loop sequentially and know that all the elements are scanned. The complexity is O(100w*100).

13. Finding popular queries:

The search engine keeps a record of all the search strings used by the user for each search through a log file, with each query string being 1-255 bytes long. Assuming that there are currently 10 million records, these query strings have a relatively high repeat read, although the total number is 10 million, but if you remove the duplicate sums, there are no more than 3 million of them. The higher the repetition of a query string, the more users query it and the more popular it is. You are asked to count the 10 most popular query strings, and you are required to use no more than 1 Gigabyte of memory.

(1) Describe your idea for solving this problem;

(2) Give the main processing flow, the algorithm, and the complexity of the algorithm.

Scheme 1: Use trie tree, the keyword field stores the number of occurrences of this query string, and the absence of occurrences is 0. Finally, the frequency of occurrences is sorted by the least push of 10 elements.

14. a *** has N machines with N numbers on each machine. Each machine stores at most O(N) numbers and operates on them. How do you find the median of the numbers?

Option 1: Start with a rough estimate of the range of these numbers, for example, assume here that they are all 32-bit unsigned integers (*** there are one). We divide the integers from 0 to into N range segments, each containing an integer. For example, the first segment is from 0 to , the second segment is from , ..., and the Nth segment is from . Then, the N numbers on each machine are scanned and the numbers belonging to the first segment are placed on the first machine, the numbers belonging to the second segment are placed on the second machine, ..., and the numbers belonging to the Nth segment are placed on the Nth machine. Note that this process should store O(N) numbers on each machine. Next we count the number of numbers on each machine in turn, one at a time, until we find the kth machine where the number of numbers accumulated on that machine is greater than or equal to , and the number of numbers accumulated on the kth - 1th machine is less than , and record this number as x. Then the median number we are looking for is in the kth machine, in the first place. We then sort the numbers on the kth machine and find the first number that is the median we are looking for. The complexity is of.

Option 2: We first sort the numbers on each machine. After sorting, we use the idea of subsumption sorting to combine the numbers on these N machines to get the final sort. Finding the nth one is the desired one. The complexity is n(i)'s.

15. Maximum Gap Problem

Given n real numbers , find the maximum difference between the n real numbers vectorizing 2 numbers on the real axis, requiring a linear time algorithm.

Option 1: The first method that comes to mind is to first sort the n data, and then scan them once to determine the maximum adjacent gap. But this method can not meet the linear time requirements. Therefore, the following method is adopted:

s Find the maximum and minimum data max and min in the n data.

s Equalize the interval [min, max] with n-2 points, i.e., equalize [min, max] into n-1 intervals (front-closed and back-open intervals), and regard these intervals as buckets, numbered with , and the upper bound of the bucket and the lower session of the bucket i+1 are the same, i.e., the size of each bucket is the same. is the same size. The size of each bucket is . In fact, the boundaries of these buckets form an equicontinuous series (with min as the first term and a tolerance of ), and it is considered that min is placed in the first bucket and max is placed in the n-1st bucket.

s Put n numbers into n-1 buckets: assign each element to a bucket (numbered index), where , and find the maximum and minimum data assigned to each bucket.

s maximum gap: in addition to the maximum and minimum data max and min other than the n-2 data into n-1 buckets, by the principle of the drawer can be seen at least one bucket is empty, and because of each bucket of the same size, so the maximum gap will not appear in the same bucket, must be a certain bucket of the upper boundary and the climate of a certain bucket of the lower boundary of the gap, and the barrels between the barrels (even if the good in the even a good bucket) must be an empty bucket. between the buckets) must be empty. That is, the maximum gap arises between the upper bound of bucket i and the lower bound of bucket j . One scan is all it takes.

16. Merge multiple sets into a set with no intersection: given a set of strings of the form . Requirements will be one of the intersection is not empty set merge, the requirement of merging the completion of the set of no intersection between, for example, the above example should be output .

(1) Describe your idea for solving this problem;

(2) Give the main processing flow, the algorithm, and the complexity of the algorithm;

(3) Describe possible improvements.

Option 1: Use concatenated sets. First all the strings are in a separate concatenated set. Then each set is scanned in turn and sequential merging combines two neighboring elements. For example, for , first check if aaa and bbb are in the same concatenation set, if not, then merge the concatenation set they are in, then see if bbb and ccc are in the same concatenation set, if not, then merge the concatenation set they are in as well. Next, scan the other sets, and when all the sets have been scanned, the set represented by the concatenation set is the requested set. The complexity should be O(NlgN). To improve the query, we can first record the root node of each node. When merging, you can combine the large and small ones, which also reduces the complexity.

17. Maximum Subsequence and Maximum Submatrix Problem

Maximum Subsequence Problem for Arrays: Given an array with positive and negative elements, find one of the consecutive subsequences that maximizes the sum.

Option 1: This problem can be solved by the idea of dynamic programming. Let denote the largest subsequence ending with the ith element, then obviously . Based on this it can be quickly realized in code.

Maximum Submatrix Problem: Given a matrix (a two-dimensional array) with large and small data, find a submatrix such that the sum of the submatrices is maximized and output this sum.

Option 1: This can be solved using a similar idea to the maximal subsequence. If we determine to choose the elements between column i and column j, then within that range, it is actually a maximal subsequence problem. How to determine the ith and jth columns can be worded to use the storm search method to carry out.

What kind of unit is the accounting processing program for summarizing accounting vouchers suitable for?

What is the intelligent electricity monitoring system and what is its function?

Is there a better off-line English-Chinese translation tool? Something simple, not too big.

What extracurricular books recommend big data?

Can I use Shanghai Telecom's Star Card package for my family?

How to quickly convert text on a computer?

32.9% of elementary school Internet users use the Internet before school age, how to build network defense for younger children?

Shaanxi Normal University Guaranteed Graduate Rate

2019 Guangzhou Spring Festival traffic forecast

Big data engineers use data visualization What are the real-world advantages