So then look at the process of conventional data analysis, first we will have a demand, and in the middle through a number of stages to reach the data preprocessing. Usually people who do data analysis start from this step, get the data to start preprocessing as well as data analysis, modeling, result visualization or product output. Today I want to share with you is that the middle through these stages, after having the demand, to the stage of data collection, the middle also need to use data analysis methods to do things, including data collection in a part of the quantitative things, that is, before the collection, we need to quantify the good before we can do the collection.
First of all, the demand for understanding and communication
We got the demand, the demand may be communicated from the customer, may be the leadership of a sentence to us, the process of speaking, different people, the communication expression of the complexity is completely different, some people are very clear, the needs of the enumeration, there are people talking about the sky sky, and finally said that my needs are like this, yourselves! It's very diffuse and open.
Then this time, his theme may be summarized in one sentence, our first step is to refine the research work. I'll give you an example: we all ate breakfast in the morning in the cafeteria, eat eggs when you may find that some eggshells are easy to peel, and even directly blow can be blown down, some eggshells are very difficult to peel, if the leadership asked you, using data analysis, say the problem, where would you start.
Get such a life scene in a very small problem: why some hard to peel some easy to peel cooked eggs? If you get this problem, your first reaction is not, I want data. Then the leader said you can go back. This is data that we have to design and implement ourselves to collect.
Our regular data analysis of people used to other people to provide organized data to analyze, but how to collect data, collect what data, usually also data analysis work.
Look at the second example: basically every year, the media will release the news that after the high school entrance examination families, the divorce rate will increase, the Civil Affairs Bureau is very much want to dispel rumors about this thing, if we are allowed to do it, what can we do.
Look at the third example: a very long time in the know-how on the particularly hot post, there is no pure friendship between men and women.
Get this demand, as a data analyst, the first step, dismantle multiple sub-topics (sub-demand), the dismantling of sub-demand is not out of thin air, when you hear this topic, there are multiple conclusions, for example, Zhihu has a couple of conclusions of very high liking rate, the first is that there is a pure friendship between men and women, the uglier the purer; and some people say, wait until after the marriage there is a pure friendship. Because of the limited energy of our research, we need to split the big topic into several, and choose one or two of them to study first. So let's split it into: whether all men and women have pure friendships, or some do and some don't; the uglier the purer - whether friendship has anything to do with face value, what is the relationship, and to what extent does face value play a role in it; pure friendships are only pure after marriage (menopause) - whether friendship has anything to do with age; and pure friendships are only pure after marriage (menopause). -Does friendship have anything to do with age; how pure is purity; does purity change. These are the sub-studies that are refined and then the content is selected and the data is collected.
In the second step, with the research content, we need to operationalize the concepts.
Maybe you haven't heard of these, but we've certainly heard of another term - abstraction or called conceptualization, which reverses the process of abstraction and is called operationalization.
Taking the example of the hard-to-peel egg thing, it can be operationalized to split it into two, the completeness of the egg and the time it takes to peel the shell, with the completeness of the egg being a measure of how much is left after peeling the egg. Through the conceptual operationalization, we have achieved the transformation of macro concepts into measurable micro concepts, with micro concepts, the next step, how to measure. The peeling time can be measured directly with a meter, so how to measure the completeness? After peeling, the surface of the egg may be pitted, or half of the egg may have fallen off, etc. There is a rigorous methodology for conceptual operationalization. The process of abstraction is often done by different people out of the results are different, repeatability is relatively poor, but operationalization, different people to do, often repeatability is very high, because follow a set of theories - concept definition, concept classification, design of natural indicators.
After operationalization, we have to design indicators as well as measurement tools.
Returning to the topic of pure friendship between men and women, we define an indicator called the purity of friendship, which is also not something that can be grasped in one or two sentences, so we have to design a specialized measurement tool. Some people may say, I do data analysis, these push to the product manager to do it, usually do so, but master data analysis method to analyze the data and do not master the method out of the data, there is often a big difference between the difference can be landed and can not be landed.
With measurement tools, the next need to test: validity, ease (for different groups), reliability, sensitivity.
The measurement tool can be validated using data analysis methods: item analysis, exploratory factor analysis, validation factor analysis, cluster analysis, IRT, and so on. If the measurement tool is not valid, then none of the subsequent will be valid, so the measurement tool is very important, these validation takes two years or more, and finally get a streamlined and effective measurement tool. Of course there are now some well-established validation scales for measurement tools, and again, these scales are widely used in psychology.
So, after the measurement tools are tested, does data collection begin? It doesn't.
The next step requires theoretical model design
When doing big data, including data mining and other related analyses, the approach involves an input layer and an output layer, which is a conventional model, but in the real world, many models are not like this, for example, Bayesian models, we researchers certainly do not design these things, perhaps Bayes, Markov These are not heard of, only people who understand data analysis methods, understand these methods, depending on our business, how many relationships are involved, whether the relationship is unidirectional or bidirectional, and so on, such diagrams are first designed by researchers who understand the data analysis methods, and then only next to the data collection stage.
Data collection can be done by crawling on the web, importing data directly from the database, offline collection, and so on.
Here we should also mention the sampling method, there are many kinds of sampling methods, and different sampling methods are selected through different theoretical models. Take a case: in the north, there will be heating in winter, and heating will produce air pollution, so will the heating have an impact on human life expectancy?
How do we verify this with data analysis, and how do we take the data? The conventional way of thinking is to get some people in the north and some in the south to see if heating will have an effect on life expectancy.
But here involves the verification of causality, causality in the relationship between three premises, the first is, the two events must be related; the second is, the cause event must occur before the result of the event; and the third is, you need to control the interference factors.
This case, some scholars have proposed improvements and innovations based on conventional methods, called breakpoint regression. He didn't draw in the north and the south, he drew in the place of the north-south divide in China. We know that life expectancy is related to a lot of factors, and choosing people on both sides of the Huaihe River effectively ensures that the environment and other factors of their lives are more or less the same, and then he got to the conclusion that having warm air will make people's life expectancy decrease by 5.5 years. Does going to the first book have any impact on future development and how big is the impact? The scholar chose the curve of a province's first book line and added or subtracted 5 points up or down, and in this 10-point band, the difference in people's future development. Again, this case utilizes the method of breakpoint regression. In reality in other scenarios, people generally consider multiple factors, so is it necessary to include all the people involved in multiple factors? Actually, no. Japanese statisticians invented an orthogonal design method to pick specific coverage characteristics data for collection.
Next, data secondary sampling.
Yoshinoya was doing various promotions, and once the marketing platform was improved, the experimental group, replaced the display image with a photo of a sexy female model with a promotional text; the control group just used an ordinary picture with text. The promotional results are very surprising, the experimental group promotions than the control group to be much lower, to find the reason, the use of secondary sampling, secondary sampling methods have PSM model, this method makes the control group and experimental group data one by one match, can effectively solve the sample selectivity bias.
The above is a variety of quantitative work that needs to be done before data analysis.