Cathy and I were confused about these concepts for a long time until we met. We usually have breakfast on Wednesdays***, and whenever we talk about this phenomenon, there is an uneasy feeling that there is indeed a new trend emerging behind the noise, a trend that may be far-reaching and represent a profound change in our entire cultural paradigm as a result of the influence of data.Cathy and I, being in the business, felt that we should play to our strengths to explore the reasons behind these phenomena, rather than to disregard them. reasons behind them, rather than ignoring them.
Before we dive in, it's important to introduce you to the era of big data, which has been hyped up in the media, and which you, like us, may find difficult to understand and incoherent. Then, this chapter will further explain how we were able to cut through the fog and discover the truth behind it, so much so that Rachel decided to offer an introductory data science course at Columbia University, and Cathy synchronized her blog to document the content of that course, and even all of the above was finally collected into a book and delivered to you.
1.1 The Big Data and Data Science Buzz
Let's put aside the hype, as many of you are probably as skeptical about data science as we are. The reason why we're talking about this right off the bat is to let you know: we're just like you! If you are also skeptical, it means that you are also likely to contribute to the healthy development of data science, so that it can have a positive impact on society, and so that data science as a discipline can become orthodox, and can take its place among the many disciplines.
Let's start with a breakdown of the reasons why big data and data science are such a cloud of confusion.
1. Most of the basic terms lack strict definitions. What exactly is big data? And what does data science mean? What is the relationship between big data and data science? Is data science the science of big data? Is data science only used by high-tech companies like Google and Facebook? Why do some people think Big Data is a cross-discipline (e.g., astronomy, finance, technology, etc.), but data science is just a techie thing? Big Data, how big is big? These terms and concepts are so ambiguous that they are simply meaningless.
2. There is a lack of public respect for researchers in the field of data science, both in academia and industry. The fact is that they have been hard at work in this field for many years, and that work is inherited from decades, if not centuries, of work done by their predecessors in various fields, including statistics, computer science, mathematics, engineering, and other disciplines. Instead, the media spreads the message to the public that machine learning algorithms were invented just last week, and that so-called big data didn't even exist before Google came along. This is ridiculous, and many of the methods and techniques being used, and the challenges we face, are simply evolving from methods, techniques and challenges that have existed in the past. We don't deny the emergence of new things and technologies, we just feel that we should maintain the necessary respect for history and the findings of those who have gone before us.
3. The media went crazy. People are placing all sorts of laurels on the heads of data scientists, people are describing them as wizards who have mastered the mysteries of the universe, and the level of frenzy is comparable to that before the financial crisis. It's easy to obscure the truth and distort the facts with all the publicity in the sky. The more noise there is in these propaganda, the less truly valid information there is. Therefore, the longer "Big Data" is blown up by the media, the more easily the public will be misled, and the harder it will be for them to get to know the truly beneficial aspects of the concept (if any).
4. Statisticians feel that what they are doing is data science. In other words, it's supposed to be their job. Put yourself in the shoes of a statistician, dear readers, and think about what it would be like to have someone take your job. Data science is also often trivialized in the media as a simple application of statistics and machine learning in the tech world. As we'll explain in the book, it's not as if we're putting "old wine" like statistics and machine learning into new bottles and calling it data science. It definitely qualifies as a separate discipline.
5. Everything that calls itself science is not really science. There may be some truth in this statement, but it doesn't mean that the term data science is meaningless, and it may not represent science, but some kind of technology.
1.2 Breaking out of the fog
Rachel's experience between getting her PhD in statistics and her job at Google may help answer some of our questions, as she says:
After I started at Google, I quickly realized that the things I used on the job were very different from what I had learned when I was doing my PhD in statistics. It's not that my knowledge of statistics was useless. On the contrary, what I learned in school provided a framework for thinking about problems, and much of what I learned in statistics provides a solid theoretical and practical foundation for my day-to-day work.
During my time at work, I have found it necessary to master many things I didn't learn in school, such as computation, programming, data visualization skills, and many domain knowledge. This experience is both unique and universal; I have a background in statistics, so I needed to supplement my knowledge with those previously mentioned, and if I had a person with a background in computers, sociology, or physics, they would have needed to supplement their knowledge with knowledge based on their knowledge deficiencies. Everyone has their own unique knowledge structure, and it's important that we can all work closely together to complement each other's strengths and weaknesses as a team to solve data problems.
The general public will surely have the idea of the above story that you will realize that the knowledge you learned in school is far from being able to meet the needs of the actual work when you go into the workplace. Therefore, the statistical knowledge taught in this book is certainly not the same as the statistical methods applied in the industry. We have some of our own views on this.
Why should statistics in school be so different from statistics in industry? Why should many school programs be so disconnected from reality?
The difference is not only between statistics in schools and statistics in industry. One ****ing common feeling among many data scientists is that when working they need to be exposed to more knowledge, methodologies, and processes (see Chapter 2 for details) that are based on statistics and computer science.
Throwing away all this media gloss given to data science, only one thing is true: data science is a new thing. It has just been born, but it has been given so much glory that it has filled people with a lot of unrealistic fantasies that will eventually be shattered. We need to protect data science, and overblowing it may prematurely kill this emerging field.
Rachel decided to look into data science as a cultural phenomenon, and she wanted to find out how other people felt about it. She started talking to people at Google, to people at a lot of startups and tech companies, and to faculty at universities (especially statistics departments).
From these contacts, Rachel felt that the contours of data science were becoming clearer, and she dug in further, deciding to offer an introductory data science course at Columbia, at the same time that Cathy serialized the course's handouts on her blog. We expect that by the end of this course, we and our students will have a clear understanding of the nature of data science. Now that we have assembled the course into a book, we also hope to help more people understand data science.