Let me talk about the concept of big data first. The big data in this article refers to the data that is larger than your computer's memory, not the big data that often needs several T's. Those data can't be processed by a computer, and it's useless for you to read them.
The two supporting libraries to be introduced in this paper are Dask and Vaex, which have been introduced for several years and are now relatively mature.
Dask mainly improves the speed of data processing through parallel technology.
Vaex claims that it can handle the statistical operation of one billion rows of data per second, and also supports visualization and interactive data exploration.
These two support libraries are not completely compatible with Pandas df, but their syntax is similar, and they both support the most common data processing operations, except that Dask focuses on using cluster technology to process data, while Vaex focuses on processing big data on a single machine.
Randomly generate two CSV files with 1 ten thousand lines 1000 columns, each with 18G and two with 36G, and the data are random numbers evenly distributed between 0- 100.
I couldn't understand the file 18G, but after playing it directly, Jupyter kernel burst into tears.
Convert two CSV files to HDF5 format.
Switching to HDF5: less than 7 minutes. After the conversion, the size of the two files is reduced to 16G.
Open the file: dv = vaex.open ('hdf5 _ files/*. HD F5') takes 20 minutes, and it will be faster if it is converted into a binary file.
Xiutou: dv.head (), almost 20 minutes, inexplicable! ! !
Quantile calculation: Quantile = dv. Percentile _ approximate ('col1',10), seconds.
Add a new column: dv ['col1_ binary'] = dv.col1> dv. Percentile _ approximate ('col1',10), seconds.
Filter data: dv = dv[dv. col 2 >;; 10], seconds out
Grouping summary: group _ res = dv.group by (by = dv.col1_ binary, agg = {'col3 _ mean': vaex.agg.mean ('col3')}), seconds out.
Visual histogram: plot = dv.plot 1d (dv.col 3, what =' count (*), limits = [0, 100]), seconds out.
Summarize all data: suma = np.sum (dv.sum (dv.column _ names)), 40 seconds.
Switching to HDF5: more than 12 minutes.
Open file: seconds, but this is because compute is not used, and this command is fatal.
Display head: ds.head (), 9 seconds.
Quantile calculation: quantile = ds. Col 1。 Quantile (0. 1). Compute () was used, and the Jupyter kernel crashed.
Add a new column: ds ['col1_ binary'] = ds.col1> ds. COL 1。 Quantile (0. 1) is not supported, so it cannot be tested.
Filter data: ds = ds[(ds.col2 > 10)], seconds.
Group digest: Dask does not support group digest.
Visual histogram: Dask does not support visual data.
Summarize all data: Dask does not support summarizing all data.
This does not support, that does not support, there is nothing like it.
After reading this test, you will have a little idea. You can test it yourself if you are interested.