Python data mining from which

I. Python based data mining Basic architecture

1. matplotlib, graphical

2. pandas, the key to data mining, provides various algorithms for mining analysis

3. numpy, provides basic statistics

scipy, provides various mathematical formulas

4. python common lib, the basic python framework

II. Environment setup

1. install python

2. install pip

pandas relies on pip version, the minimum is 8.0.0. If pip is below 8, such as 7.2.1, you need to upgrade pip.

The command is "python -m pip install -U pip", which is the windows version.

Linux is "pip install -U pip"

With the command "pip --version", you can see the pip version number

3. Installation pandas

The command "pip install pandas", this is the windows version.

Available for Linux platform

sudo apt-get install python-pandas

4. Install matplotlib

pip install matplotlib

III. Data types

pypython common type

string list tuple dict set

6 bells and whistles

list, tuple, string, unicode string, buffer object, xrange

pandas type

ndarray, series dateFrame

ndarray, array type, added because:

list, tuple is based on pointer + object design. I.e. list, tuple store void* pointers, which point to the data of a specific object.

Because they are void* pointers, both can store various data types, i.e., data types can be non-uniform.

Although the storage is rich, there are drawbacks if the amount of data is too large, i.e., when dealing with big data.

1. Large storage space, waste of memory. Because of the storage of two parts, pointer + data

2. Slow reading, through the index, to find the pointer; based on the pointer, to find the data

So in the processing of big data, the new ndarray, a numerical type, similar to the C++ array. Store the same, read, modify fast.

Alias: array, helps save memory, improve CPU computing time, there are rich processing functions

series, variable-length dictionary,

similar to a one-dimensional array of objects; there are data and index composition

added because:

dict is unordered, its key and value there is a mapping relationship. However, key and value are not independent of each other and are stored together.

If you need to manipulate one item, it affects the other. So with series, the key and value of series are independent and stored independently.

The key of series is fixed-length ordered. You get the whole index via series.key, and all values via series.values.

The key of a series can be set to a unique name via series.index.name.

series as a whole can also be set with a unique name, via series.name

DataFrame:

1. a tabular data structure

2. containing an ordered set of columns (similar to an index)

3. which can be thought of as, *** enjoying an index of the Series collection

data1={'name':['java', 'c', 'python'], 'year': [2,2,3]}

frame = pd.DataFrame(data1)

------------------------- -----------------------

IV. Basic data analysis process:

1. Acquisition of data

2. Data preparation - specification, creation of various indexes index

3. Display, description of data for debugging

e.g. df.index, df.values, df.head(n), df.tail(n) df. describe

4. Data selection

index fetch, slice fetch, row and column fetch, rectangular region fetch

index fetch, df.row1 or df['row1']

row and column, df.loc[rowlist, columnlist], e.g. df.loc[0:1, ['co1', 'co1', 'co1', df.tail(n)) df. ['co1','col2']]

Take the upper left corner of the 2D by two-bit indexing, df.iloc[0,0], or you can list df.iloc[0:2,0:2] and take the first 2 rows.

5. Simple statistics and processing

Statistical averages, maximum values, etc.

6. Grouping Grouping

df.groupby(df.row1)

7. Merge Merge

append append,

contact Connection, including the append function, also two different 2D data structures can be merged

join join,

SQL join, join based on the same field, such as sql where, a.row1 = b.row1

--------------------------------------------- ----

V. Advanced Data Processing and Visualization:

1. Cluster Analysis

Clustering is an important part of the descriptive and predictive tasks of data mining, which is based on similarity,

dividing similar objects into groups and subsets by static classification.

There are many third-party libraries in python that provide clustering algorithms.

There are many clustering algorithms, among which the K-means algorithm, is widely used because of its simplicity and speed.

The basic principle is,

1. to find the center of a dataset,

2. to use the mean square deviation, calculate the distance. So that each data point converges within a group; the groups are completely isolated

Example:

>>> from pylab import *

>>> from scipy.clusters.vq import *

>> from scipy.clusters.vq import *

>> from scipy.clusters.vq import gt;>

>>> list1=[88,64,96,85]

>>> list2=[92,99,95,94]

>>> list3=[91,87,99,95]

& gt;>> list4 = [78,99,97,81]

>>> list5=[88,78,98,84]

>>> list6=[100,95,100,92]

>>> tempdate = (list1, list2, list3, list4, list5, list6)

>>>

>>> tempdate

([88, 64, 96, 85], [92, 99, 95, 94], [ 91, 87, 99, 95], [78, 99, 97, 81], [88, 78

, 98, 84], [100, 95, 100, 92])

>>> date = vstack(tempdate)

>>>

>>> date

array([[ 88, 64, 96, 85],

[ 92, 99, 95, 94],

[ 91, 87, 99, 95],

[ 78, 99, 97, 81],

[ 88, 78, 98, 84],

[ 100, 95, 100, 92]]

>>> centroids,abc=kmeans(date,2) # Find the center of the clusters, the second parameter is to set the division into N classes, e.g., 5 if 5 classes are present

>>> centroids # Find the centroids based on each column Find centroids, possibly the mean

array([[88, 71, 97, 84],

[90, 95, 97, 90]])

>>>

>>> result,cde=vq(date,centroids) # Classify the dataset, based on cluster centers

>>> result

array([0, 1, 1, 1, 1, 0, 1])

2. Plotting fundamentals

The python portrayal library, which consists of two parts,

plotting api, matplotlib offers various depiction interfaces.

Integrated library, pylab (including numpy and matplotlib in the common methods), depiction faster and more convenient.

import numpy as np

import matplotlib.pyplot as plt

t = np.range(0,10)

plt.plot(t, t+2)

plt.plot(t,t, 'o', t,t +2, t,t**2, 'o') # (x,y) set of lines, by default; 'o' is scatter,

plt.bar(t,t**2) # bar chart

plt.show()

--------------------

import pylab as pl

t = np.arange(0,10)

plt.plot(t, t+2)

plt.show()

3. matplotlib image attribute control

Color, style

Name: plot, horizontal, Vertical axis,

plt.title('philip\'s python plot')

plt.xlabel('date')

plt.ylabel('value')

Other: pl.figure(figsize=(8,6),dpi =100)

pl.plot(x,y, color='red', linewidth=3, lable='line1')

pl.legend(loc='upper left')

Subplot

pl.subplot(211) # overall image, which can be divided into two-dimensional parts;

# the first is the rows of the plot, the second is the columns; the third is the index, which is traversed from upper left 0 through the current row and then the next row.

# If it is a 2-digit number, such as 11, you need ','

axes(left, bottom, width, height) # Parameters take values in the range (0,1), left, is the distance to the left, bottom is the distance to the bottom

< p>4. pandas plotting

Series, DataFrame support direct depiction, encapsulated in the interface to call matplotlib, such as

series.close.plot()

df.close.plot() # specific parameters similar to the matplotlib common interface

Attribute control

Similar to matplotlib common interface, modify various types of images, bar charts, polylines, etc.

--------common-----------------

list, tuple, dict

-------- numpy-----------------

ndarray, Series, DataFrame

The procuratorate can still find out if it doesn't sue the police station.

Big data access dis