What are the big data etl tools

ETL is a very important part of the data warehouse, a necessary step before and after the start.ETL is responsible for extracting data from distributed, heterogeneous data sources, such as relational data, flat data files, etc., to a temporary middle tier for cleaning, transformation, integration, and finally loading them into a data warehouse or data mart, which becomes the basis for online analytical processing and data mining.

Here is an introduction to what is ETL and the three tools commonly used for ETL -- Datastage, Informatica, Kettle.

I. What is ETL?

ETL, Extract- Transform-Load, an acronym used to describe the process of extracting, transforming, and loading data from the source to the destination.

Data Warehouse Architecture

In layman's terms, it is to extract data from the data source, clean it, process it, transform it, and then load it into a well-defined data warehouse model. The purpose is to integrate the scattered, fragmented, and standardized data in the enterprise to provide an analytical basis for enterprise decision-making.

ETL is an important part of the BI project, and its design affects the quality of the generated data, directly related to the success or failure of the BI project.

Two, why use ETL tools?

In the data processing, we sometimes encounter these problems:

When the data comes from different physical hosts, this time, such as the use of SQL statements to deal with the words, it seems to be more strenuous and greater overhead.

The source of the data can be a variety of different databases or files, which need to be organized into a uniform format before the data can be processed, this process is obviously a bit of a hassle to implement in code.

Of course, we can use stored procedures to process data in the database, but when dealing with large amounts of data, stored procedures are obviously more difficult, and will take up more database resources, which may lead to a lack of data resources, which in turn affects the performance of the database.

And the problems encountered above, we use ETL tools can be solved. ETL tools have the following advantages:

1, support for a variety of heterogeneous data sources to connect. (Part)

2, the graphical interface is very easy to operate.

3, processing massive data faster, clearer processes and so on.

Three, ETL tools

1, Datastage

IBM's commercial software, the most professional ETL tools, but at the same time the price is not expensive, suitable for large-scale ETL applications.

Difficulty of use: ★★★★

2, Informatica

Commercial software, quite a professional ETL tool. The price is a little cheaper than Datastage, but also suitable for large-scale ETL applications.

Difficulty: ★★★

3, Kettle

Free, the most famous open-source products, is written in pure java ETL tools, only JVM environment can be deployed, can be cross-platform, scalability.

Difficulty: ★★★

Four, three ETL tool comparison

Datastage, Informatica, Kettle three ETL tool features and differences in the introduction:

1, the operation

These three ETL tools are among the more simple to use, mainly to see the developers for the tool's proficiency.

Informatica has four development and management components, the development of which we need to open three of the development, Informatica does not have ctrl + z function, if the job has changed, want to undo, return to the change before it is not possible. Compared to Kettle and Datastage, which are less convenient for testing and debugging, Datastage has all the operations in the same interface, no need to switch interfaces, and you can see the source of the data and the whole job, which is more convenient than Informatica when you are looking for bugs.

Kettle between the two.

2, deployment

Kettle only JVM environment, Informatica need to server and client installation, and Datastage deployment is more time-consuming, there is a little difficult.

3, the speed of data processing

Informatica and Datastage's processing speed is faster and more stable under the large data volume.

Kettle's processing speed is a little slower in comparison.

4, service

Informatica and Datastage have good commercial technical support, while Kettle does not. Commercial software will be much better than free open source software in terms of after-sales service.

5, risk

Risk is inversely proportional to the cost, but also proportional to the technical ability.

6, expansion

Kettle's scalability is undoubtedly the best, because it is open source code, you can develop your own to expand its functionality, and Informatica and Datastage due to commercial software, basically nothing.

7, Job's monitoring

Three have monitoring and logging tools.

In the monitoring of data, I personally feel that Datastage's real-time monitoring to do a better job, you can visualize the situation of data extraction, run to which control. This is for tuning, we can faster locate the slow processing speed of the control and deal with it, while informatica also has the corresponding function, but it is not intuitive, you need to through the comparison of the two interfaces to locate the slow processing speed of the control. Sometimes you need to go through some methods to find it.

8, online technical documentation

Datastage < Informatica < kettle, relatively speaking, Datastage and Informatica in the encounter to find a solution to the problem of the probability of the Internet is relatively low, kettle is more.

V. Project experience sharing

In the project, many times we need to synchronize the production library table to the data warehouse. More than a hundred table synchronization and repetitive operations are a test of care and patience for developers. In this case, the developer's favorite tool is undoubtedly kettle, multiple table synchronization can be run with the same program, do not have to build a program for each table synchronization, and informatica, although there are tools provided to batch design, but still need to generate multiple programs to configure one by one, and datastage in this regard would seem to be more clunky.

In the case of incremental tables, we need to save the time of the latest data operation to the database after each run, and the next time we run it, we will fetch the data that is greater than this time. kettle has a control that can read this time directly from the database and set it as a variable; for informatica, which doesn't have a control with a similar function, our practice is to read the time in the database to a file, and then the main program will save this time to a file, and then the main program will read this time to a file. For informatica, which does not have a similar control, we can read the time from the database and save it to a file, and then specify the file as a parameter file when the main program is run.

Who has more users, Baidu Maps or Gaode Maps? Is there some data to support this. Also who is better to use? Dear Gods~

How to search online?

What does Taikang Star Selection mean?

Tangzhuang Village is located in which city

What's the difference between ai call and ordinary call?

The reading and writing of numbers above billion commented on the lesson

What about Chongqing Medical Group Bishan Peaceful Medicine Co.

The next 10 years of the "3 big business trends", each contains a new dividend!

Big data first love

National brand Chihuahua launched the "black pad" to create a scientific and technological sleep quality