Why Data Sampling?
A fast growing field, data mining aims to extract valid patterns or useful rules from data. The tasks of data mining are generally categorized into association rules, classification and clustering. These tasks usually involve large datasets in which useful knowledge is hidden. Calling a dataset large, the dataset either has a large number of records or a large number of attributes or a combination of both. Having a large number of records will make it take longer to match with a model, while having a large number of attributes will make the model take up more space. Large datasets are a major obstacle for algorithms in data mining, which often need to traverse over the dataset multiple times during the algorithm's pattern searching and model matching, and it is very difficult to fit all of the dataset into physical memory. When the datasets are getting larger, the field of data mining has the challenge of developing algorithms that fit into large datasets, so a simple and effective way to reduce the size of the data (i.e., the number of records) is to use sampling, i.e., to take a subset of a large dataset. In data mining applications, there exist two approaches to sampling:One approach is that some data mining algorithms do not use all the data in the dataset during algorithm execution:The other approach is that running the algorithm on a portion of the data yields the same results as those obtained on the entire dataset. This coincides with two basic approaches to sampling used in data mining. One approach is to embed sampling into the data mining algorithm; while the other approach is to run the sampling and data mining algorithms separately. However, utilizing sampling may pose a problem: its results are inaccurate in small probability cases, while the similarity of its results in large probability cases is very good.... The reason for this is that running on a subset of the entire dataset may destroy the intrinsic correlation between attributes, which is very complex and difficult to understand in high-dimensional data problems.