Current location - Loan Platform Complete Network - Big data management - Do you need to learn programming for big data?
Do you need to learn programming for big data?

Introduction:

Chapter 1: A First Look at Hadoop

Chapter 2: A More Efficient WordCount

Chapter 3: Getting Data from Elsewhere on Hadoop

Chapter 4: Getting Data from Hadoop to Elsewhere

Chapter 5: Faster, My SQL

Chapter 6: Polygamy

Chapter 7: More and More Analytics Tasks

Chapter 8: My Data to be Real-Time

Chapter 9: My Data to be External

Chapter 10: Bullshit Tall Machine Learning

Often, I am asked by beginners on blogs and QQ about what technologies they should learn if they want to move towards big data. What kind of learning route, think big data is very hot, employment is very good, salary is very high. If you are very confused and want to develop in the direction of big data for these reasons, it's okay, then I would like to ask, what is your major, for computer/software, what is your interest? Is it a computer science major, interested in operating systems, hardware, networking, servers? Is it a software major, interested in software development, programming, writing code? Or was it a math and statistics major with a particular interest in data and numbers.

In fact, this is to tell you the three directions of development of big data, platform construction/optimization/operation and maintenance/monitoring, big data development/design/architecture, data analysis/mining. Please don't ask me which one is easy, which one has good prospects, and which one has more money.

First pull the 4V characteristics of big data:

Large data volume, TB->PB

Many data types, structured, unstructured text, logs, video, images, geographic location, etc.;

High commercial value, but this value needs to be above the massive amount of data, through data analytics and machine learning to mine it out more quickly;

The processing timeliness is high, and the demand for massive data processing is no longer limited to offline computing.

Nowadays, in order to formally deal with these characteristics of big data, open source big data framework is more and more, more and more strong, first list some common:

File Storage: Hadoop HDFS, Tachyon, KFS

Offline Computing: Hadoop MapReduce, Spark

Streaming , real-time computing: Storm, Spark Streaming, S4, Heron

K-V, NOSQL databases: HBase, Redis, MongoDB

Resource management: YARN, Mesos

Log collection: Flume, Scribe, Logstash, Kibana

Messaging system: Kafka, StormMQ, ZeroMQ, RabbitMQ

Query analytics: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, Flink, Kylin, Druid

Distributed orchestration services: Zookeeper

Cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager

Data mining, machine learning: Mahout, Spark MLLib

Data synchronization: Sqoop

Task Scheduler: Oozie

......

Eye blinking, the above has more than 30 kinds of it, not to mention proficient, all of them will use, I guess there are not many.

Personally, my main experience is in the second direction (development/design/architecture), and listen to my advice.

Chapter 1: Getting to Know Hadoop

1.1 Learning Baidu and Google

No matter what the problem is, try searching and solving it yourself.

Google is preferred, and if you can't get past it, use Baidu.

1.2 Preferred official documents for reference

Specifically for getting started, official documents are always the preferred documents.

It is believed that most of the people who work on this are cultured, and the English language is fine, so if you can't read it, please refer to the first step.

1.3 Getting Hadoop up and running

Hadoop can be considered the granddaddy of big data storage and computing, and most open source big data frameworks now rely on Hadoop or are compatible with it.

About Hadoop, you need to figure out at least what the following are:

Hadoop 1.0, Hadoop 2.0

MapReduce, HDFS

NameNode, DataNode

JobTracker, TaskTracker

Yarn, ResourceManager, NodeManager

To build Hadoop yourself, use steps 1 and 2, just to get it running.

It is recommended to use the installer command line installation first, do not use the management tool to install.

Also: Hadoop 1.0 know it on the line, now use Hadoop 2.0.

1.4 Try to use Hadoop

HDFS directory operation command;

Uploading, downloading files command;

Submit to run the MapReduce sample program;

Open the Hadoop WEB interface to view Job run status and view Job run logs.

Know where the Hadoop syslog is.

1.5 It's time for you to understand how they work

MapReduce: how to divide and conquer;

HDFS: where the data really is, and what's a copy;

What Yarn really is, and what it can do;

What NameNode is really doing;

What ResourceManager is doing;

1.6 Write your own MapReduce program

Following the WordCount example, write your own (it's OK to copy it) WordCount program

Package it up and submit it to Hadoop to run.

You don't know Java?Shell, Python are fine, there's a thing called Hadoop Streaming.

If you're serious about completing the above steps, congratulations, you've got one foot in the door.

Chapter 2: A More Efficient WordCount

2.1 Learn Some SQL

Do you know databases? Can you write SQL?

If not, learn some SQL.

2.2 SQL version of WordCount

How many lines of code did you write (or copy) in 1.6 for WordCount I***?

To show you mine:

SELECT word,COUNT(1) FROM wordcount GROUP BY word;

This is the charm of SQL, programming requires dozens of lines, or even hundreds of lines of code, I'm this one sentence to get it done; using SQL processing to analyze data on Hadoop, convenient, It is convenient, efficient, easy to get started, and even more trendy. Whether it is offline computing or real-time computing, more and more big data processing frameworks are actively providing SQL interfaces.

2.3 Hive for SQL On Hadoop

What is Hive? The official explanation is:

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

Why is Hive a data warehouse tool and not a database tool? Some friends may not know the data warehouse, data warehouse is a logical concept, the underlying use of the database, the data in the data warehouse has these two characteristics: the most complete historical data (massive), relatively stable; the so-called relative stability, refers to the data warehouse is different from the business system database, the data is often updated, the data, once the data into the data warehouse, it is rarely updated and deleted, it will only be a large number of Hive. Hive, too, has these two characteristics , therefore, Hive is suitable for doing massive data data warehouse tools , rather than database tools .

2.4 Installing and Configuring Hive

Please refer to 1.1 and 1.2 to complete the installation and configuration of Hive. You can enter the Hive command line normally.

2.5 Try out Hive

Please refer to 1.1 and 1.2 to create the wordcount table in Hive and run the SQL statement in 2.2.

Find the SQL task you just ran in the Hadoop WEB interface.

See if the results of the SQL query match the results in MapReduce in 1.4.

2.6 How Hive works

Why do you see a MapReduce task in the Hadoop WEB interface when you clearly wrote SQL?

2.7 Learning basic Hive commands

Creating and deleting tables;

Loading data to a table;

Downloading data from a Hive table;

Refer to 1.2 to learn more about Hive syntax and commands.

If you have followed the process of Chapter 1 and Chapter 2 in Words for Beginners in Big Data Development carefully and completely, then you should already have the following skills and knowledge:

Difference between 0 and Hadoop 2.0;

Principles of MapReduce (or that classic topic, a file of 10G size. Given a 1G size of memory, how to use a Java program to count the 10 words with the highest number of occurrences and the number of times);

The process of reading and writing data in HDFS; PUT data to HDFS; download data from HDFS;

Will write a simple MapReduce program on their own to run the problem, and know where to view logs;

Will write simple SELECT, WHERE, GROUP BY and other SQL statements;

The general process of converting Hive SQL to MapReduce;

Common statements in Hive: creating a table, deleting a table, loading data into a table, partitioning, and downloading the data from a table locally;

From the above learning, you've learned that HDFS is a distributed storage framework provided by Hadoop, which can be used to store massive data, MapReduce is a distributed computing framework provided by Hadoop, which can be used to statistic and analyze massive data on HDFS, and Hive is SQL On Hadoop, Hive provides SQL interfaces, and developers only need to write simple easy-to-use SQL statements, Hive is responsible for translating the SQL into MapReduce, submit to run.

At this point, your "big data platform" looks like this:

So the question is, how do you get massive amounts of data to HDFS?

Chapter 3: Getting data from elsewhere onto Hadoop

This can also be called data collection, where data from various data sources is collected onto Hadoop.

3.1 The HDFS PUT command

You should have used this one earlier.

The PUT command is also more commonly used in real-world environments, usually in conjunction with a scripting language such as shell or python.

It is recommended to be proficient.

3.2 HDFS API

HDFS provides an API to write data, write data to HDFS with your own programming language, and the put command itself uses the API.

The actual environment is generally less likely to write their own programs to use the API to write data to HDFS, and usually use other frameworks to encapsulate the good method. For example: Hive INSERT statement, Spark saveAsTextfile and so on.

It is recommended to understand the principles and will write demos.

3.3 Sqoop

Sqoop is an open source framework mainly used for data exchange between Hadoop/Hive and traditional relational databases Oracle/MySQL/SQLServer, etc..

Just as Hive translates SQL into MapReduce, Sqoop translates the parameters you specify into MapReduce, submits them to Hadoop to run, and completes the data exchange between Hadoop and other databases.

Download and configure Sqoop yourself (we recommend using Sqoop1 first, Sqoop2 is more complicated).

Understand the common configuration parameters and methods of Sqoop.

Using Sqoop to complete the synchronization of data from MySQL to HDFS;

Using Sqoop to complete the synchronization of data from MySQL to Hive table;

PS: If the subsequent selection of determining the use of Sqoop as a data exchange tool, then it is recommended to be proficient, otherwise, understand and will use Demo can be.

3.4 Flume

Flume is a distributed massive log collection and transfer framework, because the "collection and transfer framework", so it is not suitable for relational database data collection and transfer.

Flume can collect logs in real time from network protocols, messaging systems, file systems, and transfer them to HDFS.

So if your business has data from these data sources and needs to capture it in real time, then you should consider using Flume.

Downloading and configuring Flume.

Using Flume to monitor a file that is constantly appending data and transferring the data to HDFS;

PS: Configuring and using Flume is More complex, if you do not have enough interest and patience, you can skip Flume.

3.5 Ali open source DataX

The reason why we introduce this, because our company is currently using the Hadoop and relational database data exchange tool, which is based on the previous development of DataX, is very good to use.

You can refer to my blog post "Heterogeneous data sources massive data exchange tool-Taobao DataX download and use".

Now DataX is version 3.0 and supports many data sources.

You can also do secondary development on top of it.

PS: If you are interested, you can study and use it to compare it with Sqoop.

If you have completed the above study and practice seriously, at this time, your "big data platform" should be like this:

Chapter 4: Getting the data on Hadoop to go somewhere else

Previously, we introduced how to collect data from the data source to Hadoop, after the data to Hadoop, you can use Hive and MapReduce to analyze. Then the next question is, how to synchronize the results of the analysis from Hadoop to other systems and applications?

In fact, the approach here is basically the same as in Chapter Three.

4.1 The HDFS GET command

GET files from HDFS to local. Requires proficiency.

4.2 HDFS API

Same as 3.2.

4.3 Sqoop

Same as 3.3.

Using Sqoop to complete the synchronization of files on HDFS to MySQL;

Using Sqoop to complete the synchronization of the data in Hive tables to MySQL;

< p>4.4 DataX

Same as 3.5.

If you have completed the above study and practice, at this point, your "big data platform" should be like this:

If you have already followed the process of chapter 3 and chapter 4 of the "Words for Beginners in Big Data Development 2", then you should have completed the whole process.

If you've followed the process in Chapters 3 and 4 of What I'm Saying to Big Data Developers for Beginners 2 carefully and completely, you should already have the following skills and knowledge:

You know how to capture existing data to HDFS, both offline and in real time;

You know that sqoop (or DataX) is the tool for exchanging data between HDFS and other sources;

You know that flume can be used as a real-time logging tool. flume can be used as a real-time log collection.

From the previous learning, for big data platform, you have mastered a lot of knowledge and skills, build Hadoop cluster, collect data to Hadoop, use Hive and MapReduce to analyze the data, and synchronize the analysis results to other data sources.

The next problem comes, Hive is used more and more, you will find a lot of unpleasant places, especially the slow speed, most of the time, obviously my data volume is very small, it has to apply for resources to start MapReduce to execute.

Chapter 5: Faster, My SQL

Actually, everyone has found that the Hive backend, which uses MapReduce as the execution engine, is just a bit slow.

So there are more and more frameworks for SQL On Hadoop, and according to my understanding, the most commonly used ones in order of popularity are SparkSQL, Impala, and Presto.

These three frameworks are based on half-memory or full-memory, and they provide SQL interfaces to query and analyze the data on Hadoop quickly. For a comparison of the three, see 1.1.

We are currently using SparkSQL, and as for why we use SparkSQL, there are probably the following reasons:

Using Spark for other things, we don't want to bring in too many frameworks;

Impala has too much of a memory requirement, and doesn't have too many resources to deploy;

5.1 About Spark and SparkSQL

What is Spark and what is SparkSQL.

Core concepts and explanations of terms that Spark has.

What is the relationship between SparkSQL and Spark and what is the relationship between SparkSQL and Hive.

Why SparkSQL runs faster than Hive.

5.2 How to deploy and run SparkSQL

What are the deployment modes for Spark?

How to run SparkSQL on Yarn?

Querying tables in Hive using SparkSQL.

PS: Spark is not a technology that can be mastered in a short period of time, so it is recommended to start with SparkSQL after understanding Spark, and work your way up.

For more information about Spark and SparkSQL, see ? /archives/category/spark

If you are serious about learning and practicing the above, at this point, your "big data platform" should look like this:

Chapter 6: Polygamy

Please don't be tempted by the name. I'm actually going to talk about data being collected once and consumed many times.

In actual business scenarios, especially for some monitoring logs, you want to instantly understand some metrics from the logs (about real-time calculations, which will be introduced in later chapters), and at this point, analyzing it from HDFS is too slow, and even though it's captured through Flume, Flume can't roll files to HDFS at short intervals, which would result in a particularly large number of small files.

In order to meet the need to collect data once and consume it many times, here is Kafka.

6.1 About Kafka

What is Kafka?

The core concepts of Kafka and an explanation of the terminology.

6.2 How to deploy and use Kafka

Deploying Kafka using a standalone machine and successfully running the self-contained producer and consumer examples.

Write and run your own producer and consumer programs using Java programs.

Flume and Kafka integration, using Flume to monitor logs and send log data to Kafka in real time.

If you have seriously completed the above learning and practice, at this point, your "big data platform" should be like this:

Then, using Flume collected data, not directly to HDFS, but first to Kafka, Kafka data can be consumed by multiple consumers at the same time, one of the consumers, is to synchronize the data to HDFS.

If you've followed the "written to the beginner of the big data development words 3" in the process of the fifth and sixth chapter of the carefully complete walk through, then you should be already have the following skills and knowledge:

Why Spark is faster than MapReduce.

Using SparkSQL instead of Hive to run SQL faster.

Using Kafka to accomplish a once-collected, many-consumed architecture for data.

You can write your own program to complete the Kafka producer and consumer.

From the previous study, you have mastered most of the skills of data collection, data storage and computation, data exchange, etc. in the big data platform, and each step of this, you need a task (program) to complete, and there is a certain degree of dependence between the various tasks, for example, you must wait for the successful completion of the data collection task before the data computation task can start running. If a task fails to execute, an alert needs to be sent to the development operation and maintenance personnel, and a complete log needs to be provided to facilitate error checking.

Chapter 7: More and More Analytics Tasks

It's not just analytics tasks, but data collection and data exchange as well. Some of these tasks are triggered at regular intervals, while others rely on other tasks to trigger them. When there are hundreds or thousands of tasks to be maintained and run on the platform, crontab alone is not enough, and a scheduling and monitoring system is needed to accomplish this. The scheduling and monitoring system is the backbone of the entire data platform, similar to the AppMaster, and is responsible for assigning and monitoring tasks.

7.1 Apache Oozie

1. What is Oozie? What are the features?

2. What types of tasks (programs) can Oozie schedule?

3. What task triggering methods can Oozie support?

4. ? Install and configure Oozie.

Chapter 8: My Data in Real Time

In Chapter 6, when introducing Kafka, some business scenarios that require real-time metrics were mentioned. Real-time can basically be divided into absolute real-time and quasi-real-time, with latency requirements for absolute real-time generally at the millisecond level, and quasi-real-time latency requirements generally at the second and minute level. For business scenarios that require absolute real-time, the more used is Storm, for other quasi-real-time business scenarios, it can be Storm, or Spark Streaming. of course, if you can, you can also write your own program to do it.

8.1 Storm

1. What is Storm and what are the possible application scenarios?

2. What are the core components that make up Storm, and what roles do each play?

3. Simple installation and deployment of Storm.

4. Write your own demo program to accomplish real-time data streaming calculations using Storm.

8.2 Spark Streaming

1. What is Spark Streaming and how is it related to Spark?

2. What are the advantages and disadvantages of Spark Streaming compared to Storm?

3. Demo program to accomplish real-time computing using Kafka + Spark Streaming.

If you have seriously completed the above study and practice, at this time, your "big data platform" should be like this:

To this point, your big data platform has been shaped by the underlying architecture, which includes data collection, data storage and computation (offline and real-time), data synchronization, scheduling and monitoring of tasks. monitoring of these modules. Next it's time to think about how to better provide data externally.

Chapter 9: My Data to the Outside World

Often, providing data access to the outside world (the business) broadly encompasses the following:

Offline: for example, providing the previous day's data to a specified data source (DB, FILE, FTP) every day, etc.; the provision of offline data can be done using an offline data exchange tool, such as Sqoop, DataX, and so on.

Real-time: for example, the recommendation system of an online website needs to get the recommendation data to the user from the data platform in real-time, and this kind of requirement has a very low latency (within 50 milliseconds).

Based on the latency requirements and real-time data query needs, possible programs are: HBase, Redis, MongoDB, ElasticSearch, etc..

OLAP analysis: OLAP in addition to the requirements of the underlying data model is more standardized, in addition, the response speed of the query requirements are also increasingly high, possible solutions are: Impala, Presto, SparkSQL, Kylin. if your data model is more scalable, then Kylin is the best choice.

Immediate query: immediate query of the data is more random, it is generally difficult to establish a common data model, so the possible solutions are: Impala, Presto, SparkSQL.

So many more mature frameworks and programs, you need to combine their business needs and data platform technology architecture, choose the right one. There is only one principle: the simpler and more stable is the best.

If you have mastered how to provide good external (business) data, then your "big data platform" should be like this:

Chapter 10: bullish machine learning

On this piece, I can only briefly introduce this layman. I'm very ashamed to have graduated with a math degree, and I regret that I didn't study math properly at the time.

In our business, we encountered problems that can be solved by machine learning in about three categories:

Classification problems: including binary classification and multi-classification, binary classification is the solution to the problem of prediction, like predicting whether an email is spam or not; multi-classification is to solve the classification of the text;

Clustering problems: from the keywords that the user searched for, the user will be roughly categorization.

Recommendation problem: to make relevant recommendations based on the user's historical browsing and clicking behavior.

These are the types of problems that most industries use machine learning to solve