Informix several methods of data de-duplication
In data processing often encounter the need to remove duplicate data, due to the different data formats each time always need to carry out different operations. Combined with the usual use, summarized several simple methods. I. The use of databases, if the amount of data is small, you can use Access, if you use a large number of large, such as informix. First of all, the establishment of a table with the source data structure is identical, according to the conditions of the establishment of a unique index to remove the weight. Then the data will be imported through the import tool into the table, access will automatically insert the unsuccessful data filtering, in accordance with the index of duplicate data can only be inserted one, easy to achieve the purpose of de-emphasis. This method applies to small data volume, do not want to write programs, one-time behavior. For large amounts of data can be through the "high-power" database system using a similar method of processing, such as the use of informix dbload tool, ignoring import errors for loading. unix environment shell command first use the sort command will file the data in accordance with the requirements of the index for sorting, and then use the uniq command to get rid of duplicate data to get the desired results. If there is a file a.txt have duplicate lines, need to remove the duplicate lines when the implementation of the following instructions: # sort a.txt > b.txt; # uniq b.txt > c.txtc.txt file is the required data. III. Writing the program