Current location - Loan Platform Complete Network - Big data management - What are the analytical misconceptions about big data?
What are the analytical misconceptions about big data?

1. Insufficient data sample size

When we analyze some specific business or user behaviors, there may be a situation where the relative attention is small and the users use very little, or in the process of extracting the data, a lot of constraints are added or a variety of user behaviors or attributes are intersected, and then we get very few user samples.

For such a small number of data samples the results are likely to be wrong, but what is a large enough sample size? There is no specific value for this, and it can usually only be analyzed in the context of a specific scenario.

Suggestion: You can lengthen the timeline or remove unimportant qualifiers to get a sufficient sample size.

2. Selectivity bias or survivor bias

Another major theoretical cornerstone of statistics is the central limit theorem.

Simply described, the mean of any one group sample in an overall sample will surround the overall mean of that group.

Often we will follow this principle with random sampling, and estimate the whole by analyzing the samples. Of course the conclusion will be closer to the real situation. But there is a question whether we are really random in the process of collecting data.

To give an example of a real business scenario, during the upgrade of a software application, the daily activity of users, playback per capita, playback time per capita, and other metrics are measured to determine whether the welcome of the new version is superior to that of the old one. It may sound like there is no problem, but in fact there is a hidden selective bias here, because when a new version is released, the first batch of users upgrading up are often the most active users. This group of users is supposed to outperform the average user on these metrics, so higher metric numbers don't mean better.

3. Dirty data

Dirty data is data that is grossly implausible or meaningless to the actual business, and is usually caused by bugs in the program, third-party attacks, and network transmission anomalies.

This kind of data is more destructive, which may trigger the program to report errors and have a greater impact on the accuracy of the indicators.

About what are the analyzing misunderstandings of big data, Qingteng Xiaobo will share with you here. If you have a strong interest in big data engineering, I hope this article can help you. If you still want to learn more about data analysts, big data engineers tips and materials, you can click on other articles on this site to learn.