Here is an introduction to what is ETL and the three tools commonly used for ETL -- Datastage, Informatica, Kettle.
I. What is ETL?
ETL, Extract- Transform-Load, an acronym used to describe the process of extracting, transforming, and loading data from the source to the destination.
Data Warehouse Architecture
In layman's terms, it is to extract data from the data source, clean it, process it, transform it, and then load it into a well-defined data warehouse model. The purpose is to integrate the scattered, fragmented, and standardized data in the enterprise to provide an analytical basis for enterprise decision-making.
ETL is an important part of the BI project, and its design affects the quality of the generated data, directly related to the success or failure of the BI project.
Two, why use ETL tools?
In the data processing, we sometimes encounter these problems:
When the data comes from different physical hosts, this time, such as the use of SQL statements to deal with the words, it seems to be more strenuous and greater overhead. The source of the data can be a variety of different databases or files, which need to be organized into a uniform format before the data can be processed, this process is obviously a bit of a hassle to implement in code. Of course, we can use stored procedures to process data in the database, but when dealing with large amounts of data, stored procedures are obviously more difficult, and will take up more database resources, which may lead to a lack of data resources, which in turn affects the performance of the database.And the problems encountered above, we use ETL tools can be solved. ETL tools have the following advantages:
1, support for a variety of heterogeneous data sources to connect. (Part)
2, the graphical interface is very easy to operate.
3, processing massive data faster, clearer processes and so on.
Three, ETL tools
1, Datastage
IBM's commercial software, the most professional ETL tools, but at the same time the price is not expensive, suitable for large-scale ETL applications.
Difficulty of use: ★★★★
2, Informatica
Commercial software, quite a professional ETL tool. The price is a little cheaper than Datastage, but also suitable for large-scale ETL applications.
Difficulty: ★★★
3, Kettle
Free, the most famous open-source products, is written in pure java ETL tools, only JVM environment can be deployed, can be cross-platform, scalability.
Difficulty: ★★★
Four, three ETL tool comparison
Datastage, Informatica, Kettle three ETL tool features and differences in the introduction:
1, the operation
These three ETL tools are among the more simple to use, mainly to see the developers for the tool's proficiency.
Informatica has four development and management components, the development of which we need to open three of the development, Informatica does not have ctrl + z function, if the job has changed, want to undo, return to the change before it is not possible. Compared to Kettle and Datastage, which are less convenient for testing and debugging, Datastage has all the operations in the same interface, no need to switch interfaces, and you can see the source of the data and the whole job, which is more convenient than Informatica when you are looking for bugs.
Kettle between the two.
2, deployment
Kettle only JVM environment, Informatica need to server and client installation, and Datastage deployment is more time-consuming, there is a little difficult.
3, the speed of data processing
Informatica and Datastage's processing speed is faster and more stable under the large data volume.
Kettle's processing speed is a little slower in comparison.
4, service
Informatica and Datastage have good commercial technical support, while Kettle does not. Commercial software will be much better than free open source software in terms of after-sales service.
5, risk
Risk is inversely proportional to the cost, but also proportional to the technical ability.
6, expansion
Kettle's scalability is undoubtedly the best, because it is open source code, you can develop your own to expand its functionality, and Informatica and Datastage due to commercial software, basically nothing.
7, Job's monitoring
Three have monitoring and logging tools.
In the monitoring of data, I personally feel that Datastage's real-time monitoring to do a better job, you can visualize the situation of data extraction, run to which control. This is for tuning, we can faster locate the slow processing speed of the control and deal with it, while informatica also has the corresponding function, but it is not intuitive, you need to through the comparison of the two interfaces to locate the slow processing speed of the control. Sometimes you need to go through some methods to find it.
8, online technical documentation
Datastage < Informatica < kettle, relatively speaking, Datastage and Informatica in the encounter to find a solution to the problem of the probability of the Internet is relatively low, kettle is more.
V. Project experience sharing
In the project, many times we need to synchronize the production library table to the data warehouse. More than a hundred table synchronization and repetitive operations are a test of care and patience for developers. In this case, the developer's favorite tool is undoubtedly kettle, multiple table synchronization can be run with the same program, do not have to build a program for each table synchronization, and informatica, although there are tools provided to batch design, but still need to generate multiple programs to configure one by one, and datastage in this regard would seem to be more clunky.
In the case of incremental tables, we need to save the time of the latest data operation to the database after each run, and the next time we run it, we will fetch the data that is greater than this time. kettle has a control that can read this time directly from the database and set it as a variable; for informatica, which doesn't have a control with a similar function, our practice is to read the time in the database to a file, and then the main program will save this time to a file, and then the main program will read this time to a file. For informatica, which does not have a similar control, we can read the time from the database and save it to a file, and then specify the file as a parameter file when the main program is run.