Current location - Loan Platform Complete Network - Big data management - Thinking in Data
Thinking in Data
Recommendations

The Internet has also matured and the Internet of Things is being built.

Everyone produces data, but only a few have the ability to play with it.

With data, insiders are the first to open a prophetic perspective, while we can't even touch north!

From accurate advertising to predicting and influencing the U.S. presidential election, why is data so amazing?

I. Plain data values

1. The value of data

a. What is data

Anything that can be recorded electronically is data.

This is not limited to numbers, but also includes electronically recorded content such as voice input sounds, photos taken by digital cameras, and videos recorded by cell phones. The definition may seem narrow, but it can help us better understand the changing nature of the data industry and develop a zeitgeist view of data.

b. What is the use of data

The value of data to an individual must be related to the core demands of their business. Only when the business value of data is clearly stated, it is easy for customers to pay for data, easy for data companies to generate revenue, and there is not so much confusion in the data industry. So what is the value of data?

We can look at this issue from three aspects:

Revenue. The most typical is Baidu paid search advertising, it is through the in-depth analysis of user search data, accurate matching, for advertisers to bring a large wave of traffic, it creates revenue growth is the value of data.

Expenses. Based on information captured by IoT technology, TV manufacturers realized that only 1% of users of a particular TV set were still using the old VGA video interface. So they decided to eliminate this interface setting, a decision that saved the company hundreds of millions of dollars in annual costs. This is the value of data analytics.

Risk. Many commercial banks have online application systems, and the risk is generally higher than signing offline. Data analytics can help them more accurately distinguish which online applicants are the good guys and which are the bad guys. This is the indirect value that data brings to a company in the form of reduced business risk.

2. What is Data Thinking

To explain the most important concept in this book, data thinking, we have to introduce the statistical term regression analysis, which is a method of determining the quantitative relationship between two or more variables that are interdependent.

In order to explain the most important concept in this book, Data Thinking, it is necessary to introduce the term regression analysis, which is a statistical analysis to determine the quantitative interdependence between two or more variables.

As the old saying goes: "Use the Tao to control the art, and use the art to drive the Tao. In the "Tao" level, regression analysis is a way of thinking, under the guidance of which we can define "business problems" as "data can be analyzed. In the "art" level, regression analysis is a data analysis tool that can be used, which will be introduced in the last chapter of this interpretation.

What kind of problem can be considered a data-analyzable problem? You need to find two kinds of variables:

Dependent variable Y: a variable that changes because someone else changes it, which is the core claim of the business.

The independent variable X: the relevant variable used to explain the dependent variable Y. In layman's terms, a change in the independent variable X affects a change in the dependent variable Y. X shows the data analyst's insight into the business.

Example

Suppose Mr. A borrows 10,000 yuan from you. You might start by analyzing Mr. A's usual behavior, and then consider whether your relationship is iron enough, whether you signed an IOU, Mr. A's family situation, and so on, and then measure the likelihood of Mr. A's repaying the money. Here the possibility of A gentleman to pay back the money is the dependent variable Y; and for the person, relationship, IOUs, family situation are independent variables X.

Data thinking is the definition of "business problems" into "data can be analyzed problems", the specific practice is in the chaotic mess of the The specific approach is to accurately locate the core business requirements (dependent variable Y) in the mess of business problems, and find the relevant factors affecting the core requirements (independent variable X), and then use a variety of data analysis tools for further research.

In the next chapter, we focus on the question, why is it so important to have a data mindset?

Second, what is big data in the end

Without understanding data analytics, it's easy to mythologize big data and think what magic it has. In fact, big data is not so mysterious, it is inextricably linked to statistics, which many people have been exposed to.

1. The relationship between big data and statistics

In this episode, Prof. Hanson Wang mentioned that there are at least two aspects of the relationship between big data and statistics:

a. The core of statistical concern is the analytical modeling of data and the portrayal of business uncertainty through modeling, which contributes greatly to big data.

b. Big data is not a substitute for sampling, on the contrary, the more big data sampling is more important.

2. How accurate is big data

"Inaccurate prediction is the norm, accurate prediction is perverse." This quote from Prof. Wang punctures the rosy expectations that many people hold for predictions.

Why the despair over accuracy? The nature of science makes it so. Statistical studies include a large number of correlations, of which only a tiny fraction are very rare causal relationships, but the importance of causality remains irreplaceable.

Correlation: a non-deterministic interdependence of objective phenomena. Example: the rooster crows and the sun rises.

Causation: the relationship between the first event (cause) and the second event (effect), where the latter event is considered to be the result of the former. Example: Press the on button and the computer lights up.

We often confuse this pair of concepts, and there are even times when events A and B, which are not even related, are superstitiously assumed to be causally related because they often occur in conjunction with each other, making a lot of jokes.

Therefore, identifying the concepts of correlation and causation is not only the key to understanding Big Data, but also a critical step in developing scientific literacy - saying no to pseudoscience!

Third, everyone should have a data mindset

Data thinking is a necessary literacy. Because we live in the information age, more or less will have something to do with data, do not have data thinking, we will be like people who do not understand the knowledge of economics speculation, easy to be levied IQ tax ah!

1. Improve the efficiency of communication

In our work, we often encounter such a situation: data experts speak the language of technology, the demand sector is talking about business issues (which includes data can be analyzed and can not be analyzed), the two sides of the communication is always difficult to carry out smoothly.

To solve this problem, it is necessary not only for professionals to get rid of the curse of their own knowledge, but also for the demand department to overcome the fear of data, and it is necessary to cultivate a data mindset from top to bottom within the company. Decision makers need to recognize what is relevant to data, and demand departments need to be able to articulate their core requirements.

In this regard, Mr. Fan vividly describes having a data mindset as "being able to order meat out of a pot with your mouth open".

This can greatly improve communication efficiency and maximize the value of data analysis!

2. Capture business opportunities

On the other hand, a data mindset may also be useful for entrepreneurs, especially in startups that have a strong connection to data. Having a data mindset can help entrepreneurs seize business opportunities, but it requires going through the following three steps:

a. Is there a startup direction I'm in where data can help me?

b. If data is important, sort out the dependent variable Y and independent variable X in the business.

c. At a strategic level, ensure that Y and X are supplied in high quality and accumulated over time.

3. Data thinking in life

If a person is not an entrepreneur, and the business issues involved are not related to data analysis, what is the use of cultivating data thinking? In fact, most of the small things in life, data thinking can give you inspiration, the key depends on how you use it?

First, developing a data mindset helps you develop a habit of thinking in a targeted way: what is the purpose of the analysis? What is the core demand? What is the dependent variable Y?

Second, once you've clarified the purpose, you'll be able to focus on the relevant independent variable, X, and you won't be stuck in a state of confusion where you can't see what you're focusing on.

Finally, you can try the simplest analyses, not to mention professional modeling, but at least you can distinguish between correlation and causation.

Fourth, a variety of data analysis methods

Read here, have you been interested in data analysis? This book also introduces several common data analysis tools, you can study them if you are interested, and then try to use them to solve the problem of data can be analyzed.

1. Regression analysis

At the "technical" level, regression analysis is a variety of statistical models. There are five main types: linear regression, 0-1 regression, ordinal regression, counting regression, and survival regression.

Linear regression, more strictly known as ordinary linear regression, is characterized by the fact that the dependent variable Y must be continuous data, while the explanatory variable X is not required much. In the world of data, linear regression can be applied to stock investment, customer lifetime value, healthcare, and so on.

0-1 regression is a regression model where the dependent variable Y is 0-1 (only two possible values). For example, gender is only "male" or "female". Purchase decisions are only "buy" or "don't buy". Cancer diagnosis is only "get cancer" or "don't get cancer". 0-1 regression can be applied to Internet credit, personalized recommendation, social friend recommendation, etc.

On the other hand, it can be applied to the personalized recommendation, social friend recommendation, and so on.

Ordinal regression is a regression model in which the dependent variable, Y, is ordinal (about the order of the data). For example, now we ask all the book lovers to score the author's presence in this issue, according to the degree of preference: 1 means very much like, 2 means a little bit like, 3 means feeling average, 4 means a little bit dislike, 5 means very dislike. This is a kind of ordinal data. The common application scenarios of fixed-order regression are: the scoring rating of the movie (1~5 stars); the satisfaction rating of the e-commerce product (1~5 stars) and so on.

Count regression. If the dependent variable Y is a count data (non-negative integer), then the corresponding regression analysis model is count regression. Count regression is often applied to: the RFM model in customer relationship management, i.e., the number of customer visits in a certain period of time; the number of children a couple chooses to have in the study of the two-child policy; etc.

Survival regression is short for survival data regression, i.e., regression models where the dependent variable Y is survival data (portraying how long a phenomenon or individual has survived), such as human lifespan, the lifespan of an electronic product, or how long a startup company has survived.

2. Data Visualization

The most basic data visualization method is the statistical chart, and a good statistical chart should meet four criteria: accurate, effective, concise, and beautiful. Common statistical charts are: bar charts, stacked bar charts, pie charts, histograms, line graphs, scatter plots, box-and-line plots, stem-and-leaf plots, and so on.

3. Machine Learning

Machine Learning represents a large class of excellent methods for analyzing data models and is a must for bookworms aspiring to become data scientists. The main methods it covers are: plain Bayes, decision trees (including random forests), neural networks (including deep learning), and K-mean clustering.

4. Unstructured data

Whether data is structured or unstructured is a relative and subjective concept. Of course, there are some of them reached *** knowledge, recognized unstructured data include Chinese text, data structures, images and so on.

Case Study

Unstructured text data does not mean that we can't do data analysis on it. Take The Book of Leaning Heaven and Slaughtering Dragon as an example, who does Zhang Wuji love the most, is it Zhao Min, is it Zhou Zhiruo, or is it Yin Li or Xiao Zhao? This book utilizes the method of data analysis to get the answer!

The first step is to extract the main characters of the novel and their titles. The next step is to determine the unit of analysis, which is taken here to be the natural paragraph. Then Zhang Wuji love who such a problem, in the end, how to define as a data analyzable problem? This book analyzes the characters from different angles, such as their appearance frequency, appearance time, intimacy, etc. Here is a brief description of the most important analysis of intimacy, which is portrayed through the number of times they appear in the same natural section (at the same time) with Zhang Wuji:

The so-called "day by day" is the true love of a long time, and from this side Zhang Wuji has the most chances of intimacy with Zhao Min, and he is most likely to fall in love with Zhao Min.

Note: Details of this case can be obtained from WeChat's public number Dog and Bear Club (ID: CluBear).

Conclusion

This is a book that can enhance cognition, does not bring you too much methodology, and can not make your life change immediately, and even you will feel a little laborious when listening to the book. However, once in a while out of their comfort zone, try to understand the former dare not touch the science problems, and then surprised to find "Oh! So that's it!" How is that not progress for us?

Author Introduction

Hansheng Wang

Professor of the Department of Business Statistics and Econometrics, Guanghua School of Management, Peking University, Director of the Center for Business Intelligence at Peking University, and Founder of WeChat public number "Dogs and Bears Club". He is a Fellow of the American Statistical Association (2014), a recipient of the National Outstanding Youth Fund (2016), and an Associate Editor of several international academic journals, including the Journal of the American Statistical Association (JASA), the Journal of Business and Economics Statistics (JBES), the Journal of the Pan-Chinese Statistical Association (PCSA), and Science in China: Mathematics (SCIENCE CHINA: MATHEMATICS).

Essential Interpretation

The following is the essential interpretation of the book "Thinking with Data" for the study and reference of the majority of book lovers, welcome to share, and can not be used for commercial purposes without permission.

Table of Contents

First, simple data values

Second, what is big data in the end

Third, everyone should have data thinking

Fourth, a variety of data analysis methods

Text

The powerful engine of the car, encountered the driver who can not touch the north, as always, can not drive to the destination. Big data is the same, if you do not have the business issues into data analyzable issues of data thinking, and then how the myth of big data can not create business value.

Big data is very hot, really know how to do very little, Professor Wang Hansheng is one of them. In the noisy new media context, Prof. Wang takes a different approach to help us develop a data mindset in our work and life with a sincere and truth-seeking academic temperament.