Vaex big data - Loan Platform Complete Network

Vaex big data

Pandas don't handle big data so well. This paper introduces two Python support libraries for processing big data, and compares these three Python data processing support libraries.

Let me talk about the concept of big data first. The big data in this article refers to the data that is larger than your computer's memory, not the big data that often needs several T's. Those data can't be processed by a computer, and it's useless for you to read them.

The two supporting libraries to be introduced in this paper are Dask and Vaex, which have been introduced for several years and are now relatively mature.

Dask mainly improves the speed of data processing through parallel technology.

Vaex claims that it can handle the statistical operation of one billion rows of data per second, and also supports visualization and interactive data exploration.

These two support libraries are not completely compatible with Pandas df, but their syntax is similar, and they both support the most common data processing operations, except that Dask focuses on using cluster technology to process data, while Vaex focuses on processing big data on a single machine.

Randomly generate two CSV files with 1 ten thousand lines 1000 columns, each with 18G and two with 36G, and the data are random numbers evenly distributed between 0- 100.

I couldn't understand the file 18G, but after playing it directly, Jupyter kernel burst into tears.

Convert two CSV files to HDF5 format.

Switching to HDF5: less than 7 minutes. After the conversion, the size of the two files is reduced to 16G.

Open the file: dv = vaex.open ('hdf5 _ files/*. HD F5') takes 20 minutes, and it will be faster if it is converted into a binary file.

Xiutou: dv.head (), almost 20 minutes, inexplicable! ! !

Quantile calculation: Quantile = dv. Percentile _ approximate ('col1',10), seconds.

Add a new column: dv ['col1_ binary'] = dv.col1> dv. Percentile _ approximate ('col1',10), seconds.

Filter data: dv = dv[dv. col 2 >；; 10], seconds out

Grouping summary: group _ res = dv.group by (by = dv.col1_ binary, agg = {'col3 _ mean': vaex.agg.mean ('col3')}), seconds out.

Visual histogram: plot = dv.plot 1d (dv.col 3, what =' count (*), limits = [0, 100]), seconds out.

Summarize all data: suma = np.sum (dv.sum (dv.column _ names)), 40 seconds.

Switching to HDF5: more than 12 minutes.

Open file: seconds, but this is because compute is not used, and this command is fatal.

Display head: ds.head (), 9 seconds.

Quantile calculation: quantile = ds. Col 1。 Quantile (0. 1). Compute () was used, and the Jupyter kernel crashed.

Add a new column: ds ['col1_ binary'] = ds.col1> ds. COL 1。 Quantile (0. 1) is not supported, so it cannot be tested.

Filter data: ds = ds[(ds.col2 > 10)], seconds.

Group digest: Dask does not support group digest.

Visual histogram: Dask does not support visual data.

Summarize all data: Dask does not support summarizing all data.

This does not support, that does not support, there is nothing like it.

After reading this test, you will have a little idea. You can test it yourself if you are interested.

Where is the Huawei factory

What are the schools of the single enrollment police school

"I'm really hurt" Jacky Cheung The lyrics of this song express a kind of how the mood Don't copy, know people with heart answer I like this song so much.

How to use the template for wechat official account typesetting? How to lengthen the page down at the end of the typesetting of WeChat official account?

What is the future development trend of mobile CRM

Where is the original database of China residents' health literacy monitoring?

What is meant by digital economy

Is Wukong's financial management safe?

The specific role of multi-camera police lights on police vehicles

How to correctly read the stock trend and seize the time to enter the market