How to become a data analyst? What skills are required

Before you learn to be a data analyst, you must be clear about what you want to accomplish. That is, what problems you want to solve or what plans you want to achieve through this skill. With this goal, you can clearly carry out your own learning plan and clarify its body of knowledge. Only a clear goal-oriented, learning must also be the most useful part, in order to avoid ineffective information to reduce learning efficiency.

1, clear knowledge framework and learning path

Data analysis of this matter, if you want to become a data analyst, then you can go to the recruitment website to see what the needs of the corresponding position, in general you will have a preliminary understanding of the knowledge structure that should be mastered. You can go to see the data analyst position, the enterprise demand for skills can be summarized as follows:

Basic operation of SQL database, will be basic data management;

Will use Excel/SQL to do the basic data extraction, analysis and presentation;

Will be able to use the scripting language for data analysis, Python or R;

Have the ability to obtain external data is a plus, such as crawlers or familiarity with publicly available datasets;

Basic data visualization skills and ability to write data reports;

Familiarity with commonly used data mining algorithms: regression analysis, decision trees, classification, clustering methods;

What's the efficient learning path? It is the process of data analysis. Generally, you can roughly according to the "data acquisition - data storage and extraction - data preprocessing - data modeling and analysis - data visualization" such steps to achieve a data analyst's journey to success. According to this order step by step, you will know what each part of the goal to be accomplished, which knowledge points need to be learned, and which knowledge is unnecessary for the time being. Then every time you learn a part, you will be able to have some actual results output, positive feedback and sense of achievement, you will be willing to spend more time to invest in it. With the goal of solving problems, you will naturally not be less efficient.

According to the above process, we are divided into two categories of analysts who need to obtain external data and do not need to obtain external data, and summarize the learning path as follows:

1. Need to obtain external data analysts:

python basics

python crawler

SQL language

python Scientific computing packages: pandas, numpy, scipy, scikit-learn

Statistical fundamentals

Regression analysis methods

Basic algorithms for data mining: classification, clustering

Model optimization: feature extraction

Data visualization: seaborn, matplotlib

2. Without access to external data analysts:

SQL language

python basics

python scientific computing package: pandas, numpy, scipy, scikit-learn

Statistics fundamentals

Regression analysis methods

Basic data mining algorithms: classification, clustering

Model optimization: feature extraction

Data visualization: seaborn, matplotlib

Next, we are from each part of the specific what should be learned, how to learn.

Data acquisition: open data, Python crawler

If the contact is only the data in the enterprise database, there is no need to get external data, this part can be ignored.

There are two main ways to get external data.

The first is to obtain external public datasets, some research institutions, enterprises, the government will open some data, you need to go to a specific site to download these data. These datasets are usually better and of relatively high quality.

Another way to get external data fee is to crawl.

For example, you can use a crawler to get the recruitment information for a certain position on a job board, crawl the rental information for a certain city on a rental website, crawl the list of movies with the highest Douban rating, get the list of Zhihu's like rankings, and the list of NetEase cloud music comment rankings. Based on the data crawled from the Internet, you can analyze a certain industry and a certain group of people.

Before crawling you need to know some Python basics: elements (lists, dictionaries, tuples, etc.), variables, loops, functions (the linked rookie tutorials are very good) ...... and how to work with established Python libraries (urllib, BeautifulSoup, requests, scrapy) to implement web crawlers. If you are a beginner, it is recommended to start with urllib and BeautifulSoup. (PS: the subsequent data analysis also need Python knowledge, the problems encountered in the future can also be viewed in this tutorial)

There are not too many tutorials on the Internet crawler, crawler to get started on the recommendation of the web page of the Douban crawler, on the one hand, the structure of the web page is relatively simple, and on the other hand, the Douban is relatively friendly to the crawler.

After mastering the basics of crawling, you still need some advanced skills, such as regular expressions, simulating user login, using proxies, setting the frequency of crawling, using cookie information, and so on, to cope with the anti-crawler restrictions of different websites.

In addition to this, data from commonly used e-commerce sites, Q&A sites, review sites, second-hand trading sites, dating sites, and job boards are all good ways to practice. These sites can get very analytical data, and most crucially, there is a lot of mature code that can be referred to.

Data access: SQL language

You may have a doubt, why did not talk about Excel. in dealing with the data within 10,000, Excel for general analysis is no problem, once the data volume is large, it will be overwhelmed, the database will be able to solve this problem. And most companies, will be in the form of SQL to store data, if you are an analyst, you also need to know the operation of SQL, able to query, extract data.

SQL, as the most classic database tool, provides the possibility of storing and managing huge amounts of data, and makes the extraction of data much more efficient. You need to master the following skills:

Extracting data in a specific situation: the data in the enterprise database must be large and complex, and you need to extract the part you need. For example, you can extract all the sales data of 2018 according to your needs, extract the data of the 50 items with the largest sales this year, extract the consumption data of users in Shanghai and Guangdong ......, SQL can help you to do all these jobs with simple commands.

Database add, delete, check, change: these are the most basic operations of the database, but with simple commands can be realized, so you just need to remember the command.

Grouping and aggregation of data, how to create links between multiple tables: this part is the advanced operation of SQL, the association between multiple tables, very useful when you are dealing with multi-dimensional, multiple datasets, which also allows you to go to more complex data.

Data preprocessing: Python (pandas)

Many times we get data that is not clean, with duplicates, missing data, outliers, and so on, and this is when you need to clean the data, and deal with the data that affects the analysis, in order to get a more accurate analysis.

For example, in the case of air quality data, there are many days in which the data is not monitored due to equipment, some data is recorded as duplicates, and some data is invalid for monitoring when the equipment fails. For example, user behavior data, there are a lot of invalid operation is not meaningful to the analysis, it needs to be deleted.

Then we need to use the appropriate method to deal with, such as residual data, we are directly remove this data, or with the proximity of the value to complete, these are all issues to consider.

For data preprocessing, learn the use of pandas, to cope with general data cleaning will be no problem at all. Need to master the following knowledge:

Selection: data access (labels, specific values, Boolean indexes, etc.)

Missing value processing: the missing data rows are deleted or filled

Duplicate value processing: duplicate value determination and deletion

spaces and outliers processing: clear unnecessary spaces and extreme, abnormal data

Related operations : descriptive statistics, Apply, histograms, etc.

Merge: merge operations that conform to a variety of logical relationships

Grouping: division of data, execution of functions separately, reorganization of data

Reshaping: rapid generation of pivot tables

Knowledge of Probability Theory and Statistics

How does the overall distribution of data look like? What are totals and samples? How are basic statistics such as median, plurality, mean, and variance applied? What is the change over time if there is a time dimension? How to do hypothesis testing in different scenarios? Most of the data analysis methods are derived from statistical concepts, so knowledge of statistics is also essential. The knowledge required is as follows:

Basic statistics: mean, median, plural, percentile, extreme value, etc.

Other descriptive statistics: skewness, variance, standard deviation, significance, etc.

Other statistical knowledge: totals and samples, parameters and statistics, ErrorBar

Probability distributions and hypothesis testing: various distributions, Hypothesis testing process

Other knowledge of probability theory: conditional probability, Bayesian, etc.

With a basic knowledge of statistics, you can do basic analysis with these statistics. By visualizing the metrics that describe the data, you can actually draw a lot of conclusions, such as which are the top 100, what is the average, and what is the trend in the last few years ......

You can use the python package Seaborn (python package) in doing these visualizations analysis, you will easily draw various visualizations and come up with instructive results. After understanding hypothesis testing, you can make a judgment about whether there is a difference between the sample metrics and the hypothesized overall metrics, and have verified that the results are in an acceptable range.

Python data analysis

If you have some knowledge, you know that there are actually a lot of Python data analysis books on the market today, but every one of them is very thick and the resistance to learning is very high. But in fact, the most useful part of the information is only a very small part of these books. For example, by implementing hypothesis testing for different cases in Python, you can actually validate your data pretty well.

Mastering regression analysis, for example, with linear regression and logistic regression, you can actually regress most data and draw relatively accurate conclusions. For example, DataCastle's training contests "house price prediction" and "job prediction" can be realized through regression analysis. This part needs to master the following knowledge points:

Regression analysis: linear regression, logistic regression

Basic classification algorithms: decision trees, random forests ......

Basic clustering algorithms: k-means.... ...

Feature engineering basics: how to optimize the model with feature selection

Tuning methods: how to adjust the parameters to optimize the model

Python data analysis packages: scipy, numpy, scikit-learn and so on

At this stage of the data analysis, the focus on understanding the regression analysis At this stage of data analysis, focus on understanding the regression analysis method, most of the problems can be solved, using descriptive statistical analysis and regression analysis, you can get a good conclusion.

Of course, as you practice more, you may encounter some complex problems, you may need to understand some of the more advanced algorithms: classification, clustering, and then you will know which algorithms are more suitable for the face of different types of problems when the model, for the optimization of the model, you need to learn how to improve the accuracy of the prediction through the extraction of features, parameter adjustment. This is a bit of data mining and machine learning flavor, in fact, a good data analyst, should be considered a junior data mining engineer.

System practice

This time, you have the basic ability to analyze data. But also according to different cases, different business scenarios for the actual combat. Being able to complete the analysis task independently, then you have defeated most of the data analysts in the market.

How to carry out the actual combat?

The public datasets mentioned above, you can find some data in the direction you are interested in, and try to analyze it from different perspectives to see what valuable conclusions you can get.

Another perspective is that you can go from life, work to find some of the problems that can be used for analysis, such as the above mentioned e-commerce, recruitment, social and other platforms and other directions have a lot of issues that can be mined.

In the beginning, you may not be very well thought out, but as you accumulate experience, you will slowly find the direction of the analysis, what are the general analysis of the dimensions, such as top list, average, regional distribution, age distribution, relevance analysis, future trend prediction and so on. With more experience, you will have some of your own feelings about data, this is what we usually call data thinking.

You can also take a look at industry analyst reports to see how good analysts look at the problem and analyze the dimensions of the problem, which is actually not a difficult thing to do.

After mastering the beginner's level of analytics, you can also try to do some data analytics contests, such as DataCastle's three contests specifically tailored for data analysts, where you can submit your answers to get scored and ranked:

Employee Departure Prediction Training Contest

King County Home Price Forecasting Training Contest in the U.S.

Beijing PM2.5 Concentration Analysis Training Contest

The best time to plant a tree is ten years ago, followed by now. Go now, find a dataset and get started!

How long does it take for qq to cancel the account?

Does Dortmund have the Champions League?

What about the Internet of Things engineering program in Jilin College of Business and Economics

Siemens inverter has no alarm information, but it stops during operation. What is the reason?

Can I learn big data if I'm not a computer science major?

Xinhua News Agency belongs to what unit

Why has the price of okt been falling?

Mechanical and electronic engineering professional job prospects

I would like to ask what are the companies that do agricultural Internet of Things?

What python mostly does