Current location - Loan Platform Complete Network - Big data management - Three Roles in Big Data Mining
Three Roles in Big Data Mining

Three roles in big data mining_Data Analyst Exam

I'm new to data mining and machine learning, from last July in Amazon only began to contact, but also because of the work needs to be passively contacted, have not been in contact with, do is demand forecasting machine learning related. Later, after the Taobao, their own interest in the initiative to do a few months and the user address related data mining work, there are some shallow insights. In any case, welcome to teach and discuss.

Additionally, note that the title of this article mimics a U.S. drama "Game of Thrones: A Song of Ice and Fire". In the world of data, we see a lot of really awesome, powerful and interesting cases. But, like a throne, data symbolizes a power and a conquest, but the path up to it is just as gut-wrenching.

Three roles in data mining

While working on machine learning in Amazon, I noticed three roles Amazon plays with data.

Data Analyzer. This is the person who analyzes the data, finds the rules from the data, and finds the Training Data for different scenarios for the data model. in addition, these people are also the ones who clean the dirty data.

Research Scientist. This role focuses on building data models based on different requirements. They jokingly refer to themselves as a species of unearthly singularity, like that Sheldon from The Big Bang Theory. These people basically play science with data

Software Developer: software development engineers. Mostly, they implement the data models built by Scientist and give them to Data Analyzer to play with. These people usually know more about various machine learning algorithms.

I believe that other companies do data mining or machine learning in these three kinds of work, or these three kinds of people, for me,

The most technical is Scientist, because the data modeling and extraction of the most meaningful vectors, as well as the selection of different methods are this kind of people to decide. These are the type of people that I don't think you can find in this country.

The most laborious and tiring, but also the most important, is the Data Analyzer, and their work is also the most important of these three roles (note: I used three most). Because, no matter how good your model and your algorithm are, you can only do a bunch of garbage work on a bunch of bad data. As the saying goes: Garbage In, Garbage Out! But this is the dirtiest and most exhausting job, and the one that makes it easiest for people to retreat.

The most unskilled is the Software Developer, now a lot of data play in China think that algorithms are the most important, and many technicians are researching machine learning algorithms. Wrong, the most important is the above two people, one is bitterly washing data Data Analyzer, the other is really understand data modeling Scientist! And like what K-Means, K Nearest Neighbor, or other Bayesian, regression, decision trees, random forests, etc. These plays, are very mature, and not Artificial Intelligence, to put it bluntly, these algorithms in machine learning and data mining, it seems like Quick Sort and the like in software design is basically no technical content. Of course, I'm not saying that algorithms aren't important, I'm just saying that they are the least important in data processing as a whole.

Quality of data

The Buzz Word that is currently in vogue - Big Data - is quite misleading. In my eyes, data is not big or small, only good or bad.

The first thing I feel most when dealing with data is the quality of the data. I will illustrate this in a few cases:

Case 1: The standard of data

In Amazon, all products have a unique ID called ASIN - Amazon Single Identify Number, which is used to identify the product uniqueness (from the barcode). In other words, no matter what you describe the item as, as long as the ASIN is the same, it's the exact same item.

This is unlike Taobao, where when you search for an iPhone, you'll find a bunch of different iPhones, some called "value iPhone", some called "Apple iPhone", some called "smartphone iPhone", and some called "smartphone iPhone". Some are called "Value iPhone", some are called "Apple iPhone", some are called "Smartphone iPhone", some are called "iPhone White/Black"......, these different descriptions of the same product are used by merchants to attract users. But the problem is twofold:

1) Bad user experience. The product-centered business model has a significantly better experience for consumers than the merchant-centered business model.

2) As long as you can't read (recognize) the data correctly, whatever algorithms and models you have behind you are useless.

So, as long as you play with data, you will find that if the data standards have not been established, what to do is useless. Data standards are the first barrier to data quality, without this thing, you do not play anything. The so-called data standards, for the data to do a unique identification is only one of the most basic step, the standard of the data is not only this, more importantly, the standard of the data is abstracted into mathematical vectors, there is no mathematical vectors, the back can not be tapped.

So, you will see that a lot of work in washing data is to put the chaotic data aggregation, which is in the establishment of data standards. There's absolutely no shortage of human work going on here. It's nothing more than:

Smart people define the standards before the data is created and are doing the data washing work as the data is created.

The average people are doing the work after the data is generated and piled up in large quantities.

Also, a word about Amazon's ASIN, this has been going on for over a decade, and what I've read on Amazon's intranet doesn't say why they have such an ID, I don't think it's because Amazon is playing with data discovery and suggesting a commodity ID, but maybe because Amazon's business model is designed to be "commodity-centric". Maybe because Amazon's business model is designed to be "product-centric". Today, this ASIN still has many, many problems, ASIN is the same can not completely guarantee that the goods are the same, ASIN is not the same does not mean that the goods are not the same, but more than 90% of the goods are guaranteed.Amazon has a special team Category Team, which has a lot of business people every day desperately in the ASIN data to correct.

Case 2: Accuracy of data

User addresses are another thing I've worked on analyzing data. I remember the excitement of seeing that data of hundreds of millions of user addresses. But then I lost the excitement. Because the addresses were filled in by the users themselves, there were a lot of pitfalls, none of which were easy to do.

The first one is fake/wrong addresses, because some merchants cheat or users do tests. So the address is wrong,

For example, directly enter "the address does not exist", "13243234asdfasdi" and so on. These kinds of addresses are recognized by my program.

There are also hard to be recognized by my program. For example: "Cosmic Road Earth neighborhood" and so on. But these kinds of addresses can be recognized by people.

There are also addresses that are not even recognizable by a person, such as: "Room 540, 5th Floor, China Southern Airlines Building, No. 23, Dongsihuanzhonglu, Beijing, China", which doesn't exist at all.

The second is a real address, but it's hard to deal with because users don't write it in a standard way, for example:

Abbreviations: "Jianguomenwai Street" and "Jianwai Street", "Industrial and Commercial Bank of China "and "ICBC" ......

Misprints: "Chaoyangmen", "Tonghuihe "......

Reverse: "Chaoyang Park on Middle East Fourth Ring Road" and "Chaoyang Park (by East Fourth Ring)" ......

Alias: some people wrote the developer's neighborhood name "Dongheng International", while others wrote the administrative name "Bali Zhuang Dongli". ......

There are too many examples to mention. It's clear that inaccurate data will make it harder for you to deal with it. There is a very good analogy, playing with data is like digging a gold mine, if the gold content is high, then the difficulty of digging is small, and it is easy to produce results, if the gold content is low, then the difficulty of digging is large, and the effect is poor.

Above, I gave two cases, designed to illustrate -

1) There is no size of the data, only the gold content of the data and the amount of garbage data.

2) What an important job data cleansing is, and it's a lot of human work.

So, this work is best done bit by bit as the data is generated.

There's an argument to be made that if the data is at 60% accuracy, you're bound to get yelled at by users for what you've done! If the data accuracy is around 80%, then the user will say, not bad! Only when the data accuracy to 90%, the user will feel really awesome. but from the data accuracy from 80% to 90% to pay the cost than 60% to 80% to pay much larger. Most according to the data mining team will stop at the 70% place. Because, beyond that, it's a pretty exhausting job.

Business Scenarios for Data

I wonder how many data mining teams really realize the important relationship between business scenarios and data mining? We need to know that it is simply not possible to make data mining and analytics models that can satisfy all businesses.

Recommending a music video is a completely different scenario than recommending a product in e-commerce. In e-commerce, as long as you buy a thing did not return, then, there is a high probability that I can believe that you like this thing, and then, for music and video, you can not arbitrarily feel that the user is like this song and this video by the user listening to this song or watching this video, so we can see that the recommendation algorithm in different business scenarios is also completely different in terms of the difficulty of implementation

Speaking of recommendation algorithms, do you, like me, sometimes have a feeling about recommendations - that a recommendation is an algorithm that sorts by different dimensions. Personally, I thought, just to mention the recommendation of this thing in some business scenarios is more Tricky, for example, there are two kinds of recommendation (not by user relationship and by item relationship of these two kinds),

One kind of *** sexualized recommendation, the result is that the recommendation of the popular things, which may be good, but this may be the user will be known to the user of the things, for example, to Beijing, I would like to find a restaurant, you Always recommend me roast duck, I want to go somewhere, you always recommend me Tiananmen Palace Temple of Heaven (because most people come to Beijing is to eat roast duck, is to go to Tiananmen), these I do not know well, but also you to recommend? In addition, **** sexualized things usually can be brushed by the water army.

The other is a kind of personalized recommendation, this need to analyze the user's individual preferences, the good is always to give me what I like, the bad is that perhaps my tastes will change with my age and the environment, and, always recommended in line with the user's tastes, and can not help the user to explore fresh points. For example, I like to eat spicy food, you always give me recommended Sichuan and Hunan cuisine, a long time I will also feel bored.

Recommendation is sometimes not a democratic vote, but a suggestion from a professional user or a senior player; recommendation is sometimes not a recommendation of what is popular, but a recommendation of what is fresh and I don't know. As you can see, different business scenarios and different product forms can play out completely differently,

Also, even for the same e-commerce, the business forms for books, cell phones, and apparel are completely different. I used to do Demand Forecasting at Amazon - using historical data to predict users' future needs.

For books, cell phones, home appliances, and other things that are called Hard Line products in Amazon, you can think of them as "standard products" (but not necessarily), and the forecasts are relatively accurate, and you can even predict the demand for related product attributes.

But in clothing such as the so-called Soft Line products, Amazon dry more than a decade have not been able to predict very well, because this kind of thing is subject to too many interfering factors, such as: the user's preference for the color and style, wear up to fit or not, loved ones and friends like ...... this kind of thing! It's too easy to change, more people buy it, but it will not sell well, so there is no way to predict it well, let alone Stock/Vender Manager's "predicting a certain color of clothes or shoes of a certain brand".

For demand forecasting, I've found that people who have been in the industry for a long time are the best at it, and machine learning is a crapshoot. Machine learning only makes sense if you are dealing with thousands of different goods and categories.

Data mining is not AI, and it's not even close. Don't think that data mining can do anything, finding a suitable business scenario and product form is more important than anything else.

Data analysis results

I see a lot of play big data, basically do data statistics, from a number of different dimensions to the performance of statistical data. The simplest and most common statistics are things like website statistics. For example: how much PV is, how much UV is, where it comes from, the distribution of browsers, operating systems, geography, search engines, and so on and so forth.

Hang on a word, do not think that you have a dozen T logs a day is the data, do not think that you will use Hadoop/MapReduce to analyze the logs, which is data mining, to put it bluntly, you are doing nothing more than a statistical work. The several T's of Raw Data, basically, means nothing, it can only be called logs, not even data, only your statistics out of these data is somewhat meaningful, can be called data.

When a user is confronted with the data of their own online store, for example: 5 people per thousand people to order, 65% of visitors are men, 30% of the 18-24 year olds, and so on. Even you give, you beat 40% of the same type of merchants with such data. As a merchant, when faced with these statistics, most people come across as having absolutely no idea what they can do? Should I change my website to be a little bit more masculine, or make it a little bit more appealing to young people? Completely at a loss as to what to do.

If you look around, you'll see that a good, good bit of the data analyzed seems to look good, but in fact there's absolutely no idea what to do next?

So, I think, the result of data analysis is not just to present the data, but should be more concerned about what can be done with the data? If you look at the results of the data analysis and do not know what can be done, then the data analysis is a failure.

Summary

In summary, here are the most important things I think data mining or machine learning:

1) The quality of the data. It is divided into the standard of the data and the accuracy of the data. The noise in the data should be eliminated as much as possible. A lot of human work is rare for the quality of data.

2) Business scenarios of data. We can not do all the scenarios to come, so the business scenarios and product form is very important, I personally feel that the narrower the business scenarios, the better.

3) the results of the analysis of data, so that people can read and understand, know what to do next, rather than for the sake of data and data.

There are many people engaged in data mining, but not many success stories (compared to the large number of attempts), for the moment, I seem to think that the current data mining technology is a transitional technology, still in the fumbling stage. In addition, a number of data mining teams are getting business-unbusiness and technology-untechnology, and I feel sorry for the technicians in them ......

Sorry, I only gave questions, not suggestions, which also shows that there are a lot of opportunities in data analytics... ...

Finally, one more thing to mention is "privacy in data", which seems to be like that unethical black magic where you have to make yourself dark to succeed. Yes, data is like a throne, symbolizing power and conquest, but the path to it is just as frightening.

The above is what I have shared with you about the three roles in big data mining, more information can be concerned about the Global Green Ivy to share more dry goods