Abstract:
We already know that the time complexity of randomly accessing the elements of an array is only , which is extremely efficient, and when we want to take advantage of this feature of arrays, we need to store the information corresponding to the subscripts of the elements. For example, there are only four items in a store, numbered 0 to 3, so you can store the four items in an array according to their numbered subscripts, and then you can quickly find the corresponding item information in the array based on the number.
If after a period of time, the store is profitable and re-stocked 100 pieces of merchandise, merchants want to distinguish between a large number of goods in the number of categories, this time you need to use the category number and sequential numbering of the way to identify each piece of merchandise, this numbering has become more complex, and can not be directly corresponding to the subscripts of the array, this time the number of the goods and how to correspond to the subscripts of the array in order to achieve the function of rapid search for the product? At this point we can remove the category number after the sequential number corresponds to the array subscripts, but also enjoy the benefits of efficient random access to the array. In this example, the item number is called "Key" or "Keyword", and the way to convert the key to an array subscript is to use a "Hash Function" or "Hash Function", and the value generated by the hash function is called a "Value", which is called a "Value". The value generated by the hash function is called a hash value or hash value, and such an array is a hash table.
From the principle of the hash table, the data is calculated by the hash function to get the hash value is the key, this step in the hash function is the core of it, a hash function needs to comply with the following three principles.
Because the hash value generated by the hash function corresponds to the array subscripts, and the array subscripts are non-negative integers, so you need to meet the first principle; two equal data hash algorithms through the hash value must be equal, otherwise the use of the hash value of the hash table to find the data is not possible to talk about; as for the third principle of the rationale, but it's not easy to do, even if the widely used hash algorithm will also have a hash value. Even widely used hashing algorithms can have conflicting hash values, making it impossible to satisfy the third principle.
The hash function, as the core part of the hash table, must not drag the execution efficiency of the hash table behind, after all, the hash table query, insertion and deletion operations need to go through the hash function, so the hash function can not be too complex, the execution efficiency can not be too low. As the hash function will inevitably have a hash conflict, the hash function should try to reduce the hash conflict, so that the hash value can be evenly distributed in the hash table.
To solve the hash conflict, there are open addressing (open addressing) and chaining (chaining) two types of methods.
The open addressing method refers to the insertion operation, when the generated hash value corresponding to the slot has been occupied by other data, the detection of free space for insertion, which is divided into Linear Probing (Linear Probing), Quadratic Probing (Quadratic Probing) and Quadratic Probing (Quadratic Probing). Quadratic Probing) and Double hashing.
Linear probing is one of the simpler ones, which is a sequential lookup when a hash conflict is encountered (lookup to the end of the array and then turn to the head of the array to continue to lookup) until an empty slot is found and the data is inserted. When the find operation, the same operation, using the hash value from the hash table to take out the corresponding element, and the target data comparison, if not equal to continue to look for sequential search until the corresponding element is found or encountered empty slots until the worst-case scenario to find the operation of the time complexity may be reduced to .
In addition to insertion and lookup, hash lists also support deletion, but you can't set the element to be deleted to null. If the delete operation is to set the element to null, the lookup will end when it encounters an empty slot, and the data stored after the deleted element may not be found correctly. The delete operation should be performed by marking the element instead of setting the element to null, and when the deleted element is found to be marked as deleted, the lookup will continue instead of stopping there.
Linear detection is the detection of one element at a time, and secondary detection is the quadratic step detection using both linear detection. For example, if linear detection is , then quadratic detection corresponds to .
Double detection is when the first hash function conflicts with the second hash function to calculate the hash value, using this way of detection. For example, when the first hash function conflicts, use compute hash, and if it conflicts again, use compute hash, and so on.
The amount of free space in the hash table is represented by the load factor, which satisfies the mathematical relationship that the larger the load factor, the smaller the free space in the hash table, and the greater the likelihood of a hash conflict, which is usually kept as a percentage of the free space in the hash table.
In order to maintain a certain percentage of free space in the hash table, you need to move the hash table data when the loading factor reaches a certain threshold, but the hash table move is more time-consuming. You can imagine that after applying for a new, larger hash table space, you need to re-generate the hash values from the old hash table through the hash function and store them in the new hash table, which is troublesome to think about.
The hash table moving operation will definitely reduce the efficiency of hash table operation, so can this process be improved? In fact, you can spread the inefficient expansion operation to the insertion operation, when the loading factor reaches the threshold, instead of moving the hash table at once, you can move an old hash table data to the new hash table in each insertion operation, so that the efficiency of the moving operation has been improved, and the time complexity of the insertion operation can still be maintained in a highly efficient way. When both the old and new hash tables exist at the same time, the query operation should be slightly modified, you need to query the new hash table first, and then go to the old hash table to look for the target data if you do not find it.
Of course, if you want to use memory more efficiently, you can shrink the hash table when the load factor drops to a certain threshold.
In addition to open addressing, you can also use the chained-list method to resolve hash conflicts. The slot corresponding to the hash value does not store the data directly, but stores the data on the chain table corresponding to the slot. When the lookup operation is performed, the corresponding slot is found according to the hash value calculated by the hash function, and then the corresponding data is looked up on the chain table corresponding to the slot.
The time complexity of the chained-list method is related to the hash table slots and the distribution of data in the slots. Assuming that there are n data uniformly distributed in the hash table with m slots, the time complexity of the chained-list method is . The chained table method can be used without caring about the loading factor as in open addressing, but care needs to be taken in the calculation of the hash value by the hash function, so that the chained table nodes can be distributed on the hash table slots as evenly as possible, to avoid degrading the hash table into a chained table. Sometimes hackers even craft data to create hash conflicts using the hash function to concentrate data on certain slots, causing extreme degradation of hash table performance.
Is a hash table a sitting duck in the face of such malicious behavior? In fact, not, when the slot on the chain table is too long, it can be transformed into a previously learned jump table, etc. The time complexity of the query after the chain table is transformed into a jump table is only degraded to , which is still an acceptable range.
Chained table method is more efficient than open addressing in storage utilization, no need to apply for storage space in advance, just apply for a new node when there is new data. Moreover, the chained table method is not so sensitive to the loading factor, the increase of loading factor just means that the chain table corresponding to the slot is longer, and the growth of chained table also has the strategy of transforming the chained table into the structure of the jump table, so the chained table method can maintain high efficiency in the case of the loading factor more than one.
Open addressing does not have the same trouble as chained tables, which are too long and lead to lower efficiency. However, the loading factor is a barometer for open addressing, and a high loading factor causes the chance of hash conflicts to increase, so open addressing requires constant detection of free locations, and the cost of executing the algorithm is constantly increased. And in the deletion operation can only mark the data as deleted first, for the frequent addition and deletion of data efficiency will be affected.
Of course, it is also possible to dynamically expand the hash table before this risk arises, but then there will be a large amount of free storage space, resulting in inefficient utilization of storage, a phenomenon that is more pronounced in the case of larger data volumes. So open addressing is more suitable for smaller data volumes.
The chain table method is more flexible for handling hash conflicts, and at the same time, the utilization efficiency of the storage space is also higher, but in addition to storing data, the chain table nodes also need to store pointers, and if the storage data is small, the storage occupied by pointers can even lead to the doubling of the overall storage, but when the storage data is large, the storage occupied by pointers can be negligible, so the chain method is more suitable for storing data objects that are large, but the frequent addition and deletion operations will not lead to the doubling of storage, so the chain method is more suitable for storing data objects that are large. Therefore, the chained table method is more suitable for storing larger data objects, but the frequent addition and deletion operations will not affect the chained table method significantly. Because of this feature, the chained table method is more suitable for large data volumes, or when the data object is large, if the data operation is frequent, then the chained table method is the best choice.
A hash table is an extension of an array that uses a hash function to compute a key as a hash value, which corresponds to the subscript of the array in which the data is stored. Although hash tables are more efficient, there is the problem of hash conflicts, which can be solved by open addressing and chained list methods.
Open-addressing storage utilization efficiency is low, applicable to the smaller amount of data and infrequent additions and deletions, if the amount of data is large, frequent additions and deletions of the situation more applicable to the chained table method, the chained table method is more universal.