We have given you a brief introduction in the previous article about some of the basic skills required for a big data operator. Here we'll take a look at what you need to know at different stages of learning when it comes to big data.
Data storage stage: SQL, oracle, IBM and so on have related courses, Changping java course training organization suggests that according to the company's different, learn the development tools of these enterprises, basically can be qualified for this stage of the position.
Data mining cleaning and screening: big data engineers, to learn JAVA, Linux, SQL, Hadoop, data serialization system Avro, data warehouse Hive, distributed database HBase, data warehouse Hive, Flume distributed logging framework, Kafka distributed queuing system course, Sqoop data migration. pig development, Storm real-time data processing. Learn the above can basically get started big data engineers, if you want to have a better starting point, it is recommended that the early learning scala programming, Spark, R language and other basic now more professional skills inside the enterprise.
Data analysis: on the one hand, to build a data analysis framework, such as determining the analysis of ideas need to marketing, management and other theoretical knowledge; there are also data analysis conclusions for the analysis of the proposed guiding significance of the analysis of the proposal.
Product Adjustment: After analyzing the data to the boss and PM after consultation with the product update, and then handed over to the programmer for modification (FMCG category to adjust the shelves of goods).
Then come to understand the big data need to master those technologies
Hadoop core
(1) Distributed Storage Cornerstone: HDFS
Introduction to HDFS introductory demonstration of the composition and analysis of the principle of work: the data block, NameNode, DataNode, data writing and reading process, data replication, HA program, File types, HDFS commonly used settings JavaAPI code demonstration
(2)Distributed Computing Fundamentals: MapReduce
MapReduce introduction, programming model, JavaAPI introduction, programming case introduction, MapReduce tuning
(3)Hadoop Cluster Resource Manager: YARN
YARN basic architecture resource scheduling process scheduling algorithms on YARN computing framework
Offline computing
(1) offline log collection tool: Flume
Flume introduction to the core components of the introduction to the Flume example: logs, suitable for scenarios, common problems.
(2) offline batch processing essential tools: Hive
Hive in the big data platform positioning, overall architecture, the use of scenarios of AccessLog analytics HiveDDL&DML introduction to the view function (built-in, window, custom functions) table partitioning, bucketing and sampling optimization.