What is the big data problem?

In our opinion, computer science often talks about the management of resources. The most typical resources are time, space, and energy. Data was not previously considered a resource, but rather an object that uses resources. However, we see that data is now considered a resource, a resource that we can utilize and derive value and knowledge from. We combine data resources with the resources of time and space that we use to form a system that allows us to make timely, cost-saving, high-quality decisions and conclusions, so we have to weigh them differently. But data resources are very different from time and space resources. If I give you more time and space you'll be happier, but that's not the case with data; it's not like the more data you're given the happier you'll be. It's like if you walk into a company and ask them what your biggest problem is, they usually say the biggest problem is too much data. As it stands, more and more data is going to cause us more and more problems. So we have to find a solution to this problem, one is a statistical approach and the other is a computational approach. The statistical approach is probably more subtle, so let's spend more time on it later. 1. questions about complexity grow faster than the data. Some data scientists they talk a lot about how in a database table the rows represent people and the columns are records of characteristics of people, and basic databases might have thousands of rows-meaning that there's information about thousands of people in a single database, and then you collect basic information about each individual, which doesn't need to be a lot, such as the age of the individual, their address, their height, income, that's enough data for you to know about each person in that database. Now let's consider the millions of "rows", because we are really interested in the personality and details of each person. For example, if you live in Tianjin, if you like Michael Jackson, if you like to ride a bicycle, what is the probability that you suffer from a certain disease, and so on, we have information about you in the database. So we see that the number of rows about the number of people is increasing and also the description is more then the number of columns. Some of them we can also add columns like what this person ate yesterday, his music and reading preferences, and the characteristics of his genes and so on. But the problem is that we are not only interested in individual columns, we are more interested in collections of columns. If you live in Tianjin, you like to ride a bike, your favorite fruit is an apple, these are specific collections of information about these columns. Now the problem is that we need exponential column and row growth in a combinatorial way, and as the number of rows and columns grows linearly, the data we are considering increases exponentially. Let's take a medical case where columns are envisioned as information about liver disease - 1 is with liver disease, 0 is without liver disease; but there are some columns that describe conditions that are good predictors of liver disease. Suppose if you like to be in Tianjin, you like to ride your bicycle, and you like to eat bananas, such a person will get liver disease. If you go to the doctor at this time, and the doctor asks you where you live and you say Tianjin; and the doctor asks you what you do on weekends and you say ride your bike; and asks you what your favorite fruit is and you say bananas, then the doctor informs you that you need to have your liver checked. This is of course an assumption. Inside any instruction set you need to look at this data, make arguments, and find patterns that make sense. But as the data gets bigger and bigger, finding meaningful patterns and information becomes harder and harder. So big data is not a very good thing, and it's not that having more data will lead to more knowledge. Big data is actually the biggest trouble. It's getting harder and harder to look at data and turn it into knowledge, and we need to take some action if we want to get to something really meaningful. We statisticians are very worried: how should we eliminate the noise and really get the knowledge contained inside. Statistical programs and algorithms that have to run on computers, . Bigger data takes more time to run, so we can't make decisions as fast anymore. When there are really big problems, we don't know how to solve and run statistical programs to make fast decisions, so we found a second solution. The first one is statistical and the second one is computational. 2. big data can lead to complex algorithms not being able to be used in an acceptable timeframe The second is the computational aspect, the algorithms need time to run, and to log in, output, etc., and it takes a few seconds to make a decision, for example, online auctions take a few seconds to make a decision, and we need to give some data to the algorithms, for example, the output. When there is more data, this method may not be able to complete, or it takes a lot of time to run, what do we do then? Do we discard the data? What is the result of discarding? It may increase my database space if I keep deleting my data. I should make the data run slower, but then it will take too long to process. We are facing big problems where we are combining time and space with data, growing data sizes, without good scaling algorithms for handling these big data. This is really an existential problem that I think is fundamental and basic.