Big data improves the efficiency of statistics and algorithms from the top down

When we go to develop these computational systems, whether it's software, computation, we're really talking about the conceptualization of big data analytics, when things go wrong, how do we get to a high level of accuracy, and that's just the beginning of this problem. In fact, as a computational scientist, we often encounter a lot of problems, some of which are statistical, but we don't have a joint statistician to consider and solve them.

For example, the consistency of this result, then there's the theory of bootstrapping programs, so just like regular bootstrapping programs, there are limits that are reached, and there's a top-to-bottom computation, a statistical trade-off between the pros and cons of what does that mean? Our understanding of data computation, which means more data requires more computation, more computational power. How do we do that? Is it parallel processing? Or subsample extraction and so on. You give me more data, I'm going to be happier because I'm going to be able to get higher accuracy, I'm going to make fewer errors, I'm going to get more correct answers at a lower cost. For statisticians this is good, but for the ones doing the calculations this is not so good because we will think about this in this way. That means give me some data, then we have a new idea called algorithmic weakening of control, like if I don't have enough data, I can process it fast. Too much data and I can process it slower. From a computational point of view, the algorithm of control allows me to process the data faster, which is algorithmic weakness. Statistically speaking, being able to process more data yields a better statistical increase in answer performance. While the budget cost of computation remains the same, we are able to process more data, at a faster rate, and the price we pay is the weakening of the algorithm.

So, this coordinate, which you don't look at very often, the horizontal axis refers to the number of samples we took, and the vertical axis represents the run time. Let's look at how many errors there are. We're now going to think about fixed risk. For example, in the region where our error rate is 0.01,this coordinate, for a statistician to fix the risk, then there has to be a certain number of samples in order to get this result. So, this is what is called a typical expectation theory, which is very well understood. Similarly for in computer science, we have what's called the concept of load balancing, it doesn't matter how many samples you have, but you must have enough operating time, otherwise, you can't solve the problem, that's a very clear point.

So let's look at the actual algorithm. There's a certain amount of uptime, there's a fixed risk, and with all the algorithms that are used on the right hand side, by making the algorithms weaker, we're able to handle more data. Let me talk a little bit about that, and this is what we call problem noise reduction, and by noise reduction, I mean that there is some data that falls into the category of creating noise in terms of data. How do we do noise reduction? First of all, we assume that the possible answer is a subsample like X and then we cover it with high accuracy, so it's a process of inferential prediction. Let's say I want to find the value of X which is very similar to Y. That's a natural prediction. Now X is a very complex value and I can't do it, so I'm going to do a convex domain of values, I'm going to do qualitative, and at the same time I can get the optimal point, and I need to put it within a feasible size magnitude, and then that means that any fixed risk is based on X. On the left is the risk, I need half of it, and there is complexity here, and for more complexity, you guys can look at some of the so-called theoretical treatments of the literature that you can read to make such equilibrium curves.

Let's look at the correlation, if you're going to get to a certain risk, you have to have a certain sampling point. This is a C,maybe this C is also computationally difficult to figure out, so we need to do the C subset of that and weaken that subset so that we can calculate it better. We can do hierarchical tiers, which we call pooling domains, and sort them according to the computational complexity. Also, there's statistical complexity and then a tradeoff. You can calculate this curve from the math. Here's an example, let's say X,someone has just described what subsetting means, and then you can set the running time, and the complexity of sampling, and then you can figure out the answer. You look at simple C, complex C, and then you look at the running time is decreasing, and the complexity is a constant value, so that your algorithm is simpler and can be used for big data without increasing the risk, but also can be more simplified in terms of proof. If it's a graph value of a signal, your runtime is determined by the PQ value, and you have a domain value, we'll have a constant sampling, and you can calculate it by "column" at the same time, and get the accuracy we're expecting, and the runtime is constant, and you can look at these formulas for yourself.

So, what I would like you to remember about this analysis is that with this theoretical computational science, the point is to be able to put the accuracy at a level. Because we're going to be concerned about the risks related to the quality aspect, the statistical aspect, the algorithms in computational science can help us solve the bigger problems, the big problems that come with big data. At the same time, we have a lot of data theory that we can apply, and let's not think in terms of statistical simplicity, but in terms of computation.

Maybe you're also going to take some basic theory in statistics, and of course if you're taking statistics, you're also going to take a computer science course. For those of you who are taking both disciplines, you should think about both disciplines together, it's not like statisticians are only thinking about statistics and computer scientists are only thinking about the computer side of things, we need to address the statistical side of the risk. So we can better deal with 100,000 sampling points, all without running into problems.

Smart Park Big Data Visualization Big Screen

How to use data view

What is the information technology foundation test of Fujian college entrance examination?

How is the DiLink system of BYD Tang Fuel and how is it better compared to other brands?

Which research direction does the evaluation and analysis paper of e-commerce platform belong to?

Big data one-on-one hit

How much do kids feet grow in a year

Can big data scan specific days?

Unicom data flow

How to write the tester for the health report of students and teachers in Tangshan