Current location - Loan Platform Complete Network - Big data management - Big Data, is there any requirement for cultural courses to study this
Big Data, is there any requirement for cultural courses to study this
Big Data is now a very broad career path. Broad big data in industry is divided into:

Infrastructure. This position is mainly for big data to provide the underlying storage, infrastructure, etc., the requirement is to be familiar with Hadoop, Spark and other distributed clusters.

Data Warehouse. This position is closely linked to the business, the main work to do is to y understand the business, design a good business scalability of the data warehouse. The requirement is to be able to write SQL and understand data warehouse design.

Data analysis/data mining/algorithm development. This type of position belongs to the application of big data, the work of really transforming data into productivity.

The dependence of these three types of positions is 3->2->1.

We generally say that the direction of big data should be referred to as 3. data analysis/data mining/algorithm development. This position not only to learn the knowledge and skills required for this position, but also must understand the technology of position 1 and 2. For example, you are involved in making a recommendation system, then you have to get the data, simple analysis of the data need to use Sql (generally HiveQL), and then more complex logic have to write MR (MapReduce program, the same below) or Spark program, more complex logic or scenarios that can not be solved by rules, it is necessary to use the knowledge of machine learning and so on. Then from this example, you can see that the knowledge structure you need is:

Will write class SQL (including MySQL, HiveQL, etc.)

Will write MR or Spark program, know the principle of distributed clustering, MapReduce principle is better.

Knowledge of probability theory and statistics.

Machine learning algorithms.

The following is an expansion of these four knowledge structure directions.

Know how to write class SQL

The best way to learn SQL is to write it. Because the syntax of SQL is relatively simple, and there is no principle of architecture or anything, so the best way to learn is to Just write SQL!

Can write MR or Spark programs

When you do data mining and machine learning, whether it's to train a model or to fetch data from Hive, it's more common to write MR or Spark programs. People who do data warehousing put the data into Hive, and then we do analysis and mining directly from the Hive to take the data. How to play with the data taken out depends on statistics and machine learning algorithms. Training models is a computationally intensive task, generally will be put on a distributed cluster to run (Hadoop, Spark cluster) to run, then you need to write distributed computing programs, that is, MR and Spark programs. The approach to learning MR and Spark is also mostly engineering in nature, writing more code. It is also necessary to look at Google's MapReduce paper to understand how MapReduce works.

Probability theory and statistical knowledge

The above mentioned, we do analysis, do mining from Hive out of the data, how to play tricks, algorithms, then this type of algorithms of the mathematical basis of probability theory and statistical learning. Here we recommend two books written by Prof. Chen Xiru of CSU, "Probability Theory and Mathematical Statistics" and "Tutorial on Mathematical Statistics". Mr. Chen is more thorough, my feeling is that it is difficult to gnaw, but after gnawing the understanding is very thorough.