Current location - Loan Platform Complete Network - Big data management - What tools are generally used to do big data analysis?
What tools are generally used to do big data analysis?
Java: As long as you understand some of the basics can be done to do big data does not require a very deep Java technology, learn java SE is equivalent to have to learn big data.

Linux: Because big data-related software are running on Linux, so Linux to learn some solid, learn Linux on your fast mastery of big data-related technology will be of great help, can make you better understand hadoop, hive, hbase, spark and other big data software running environment and network environment configuration, can less step on a lot of pits, and learn to learn how to use the software, so you can use the software to do a good job. If you are a big data user, you will be able to step on a lot of potholes and learn how to read scripts so that it is easier to understand and configure big data clusters. It also allows you to learn the new big data technology faster in the future.

Well, after finishing the basics, and then say what else you need to learn about big data technology, you can learn it in the order I wrote it.

Hadoop: This is now a popular big data processing platform has almost become synonymous with big data, so this is a must learn.Hadoop includes several components HDFS, MapReduce and YARN, HDFS is a place to store data like our computer hard disk files are stored on this, MapReduce is a data processing calculator, it has the ability to process and calculate data. MapReduce is the data processing calculations, it has a characteristic that no matter how big the data as long as you give it time it will be able to run through the data, but the time may not be very fast so it is called batch processing of data.

Remember to learn here can be used as a node for you to learn big data.

Zookeeper: this is a million bucks, the installation of Hadoop's HA will use it, the future of Hbase will also use it. It is generally used to store some of the interoperability of the information, the information is relatively small generally does not exceed 1M, are using it to have a dependency on the software, for us personally only need to install it correctly, so that it runs up can be normal.

Mysql: we have finished studying the processing of big data, the next study to learn the processing of small data tools mysql database, because a moment to install the hive when you need to use, mysql need to master to what level that? You can install it on Linux, run it, configure simple permissions, change the root password, create a database. The main thing here is to learn the syntax of SQL, because hive's syntax is very similar to this.

Sqoop: This is used to import data from Mysql into Hadoop. Of course, you can not use this, directly export the Mysql data table into a file and then put on HDFS is the same, of course, the production environment to use to pay attention to the pressure of Mysql.

Hive: this thing for the SQL syntax is a godsend, it allows you to deal with big data becomes very simple, will not bother to write MapReduce program. Some people say Pig that? It's pretty much the same as Pig master one and you're good to go.

Oozie: Since I learned Hive, I believe you must need this thing, it can help you manage your Hive or MapReduce, Spark scripts, but also can check whether your program is executed correctly, the error sends you an alarm and can help you retry the program, and most importantly, it can help you configure the dependencies of the task. I'm sure you'll love it, otherwise you'll be looking at a huge pile of scripts and a dense crond that doesn't make you feel like shit.

Hbase: This is a NOSQL database in the Hadoop ecosystem, his data is stored according to the form of key and value and the key is unique, so it can be used to do the ranking of data, it can store a lot of data compared to MYSQL. So it is often used as a storage destination after big data processing is complete.

Kafka: this is a better use of the queue tool, queue is what? Queuing to buy tickets you know? Data more than the same need to queue processing, so that you collaborate with other students will not be called up, why do you give me so much data (such as hundreds of G files) how do I deal with it, you do not blame him because he is not engaged in big data, you can tell him that I put the data in the queue you use when one by one to get, so he will not complain about it immediately gray flow to optimize his program to go because the processing is not over is his thing. processing is not over is his thing. And not you give the problem. Of course we can also use this tool to do online real-time data into the library or into the HDFS, this time you can use with a tool called Flume, which is specifically designed to provide a simple processing of data and write to a variety of data recipients (such as Kafka).

Spark: it is used to make up for the shortcomings of the speed of MapReduce-based data processing, which is characterized by loading the data into memory to calculate instead of reading slow to death evolution is also particularly slow hard disk. It is particularly well suited for iterative operations, so algorithmic streams are particularly rare. It is written in scala. either the Java language or Scala can operate it, because they both use the JVM.