ETL processes: how to get value from data

A study by Bain & Company estimates that managers who rely on data analytics make decisions five times faster than those who do not. This is possible thanks to ETL processes, which enable them to get the maximum utility out of data.

Analytics and business intelligence are areas that are increasingly gaining momentum in business. In fact, a Gartner report predicts that the market value of these areas will reach $20 billion by 2019, which will multiply the demand for data processing-related technologies.

What are ETL processes?

ETL (extract, transform and load) processes are those by which information is extracted from one or more data sources. This information is transformed to adapt it to the needs of the business and then loaded into a shared site for consultation by all interested parties.

One of the most important features of the ETL system is that it allows data from heterogeneous sources to be integrated into a homogeneous environment. However, the workflow configuration has a certain level of complexity and a poorly designed ETL process can cause costly operational problems.

Why are ETL processes important?

ETL processes are the main source of information for the business layer of the enterprise. Sometimes the only one. These processes allow managers to have a clear, detailed and deep picture of the business management. And thanks to this information, at a later stage of analysis, patterns and trends can be detected and high impact strategic decisions can be made.

However, data alone does not provide any value, as it is collected in a raw state that is neither readable nor analysable by people. For data to be valuable, it must be processed and transformed from its raw state into a readable and actionable state. This is where the ETL process comes into play.

For example, a customer’s purchase history consists of, among other things, the following information:

Customer contact details and other relevant data.
List of items purchased.
Details of these items: descriptions, prices, quantities, etc.
Final amount of each order.
Transactional status of orders (in progress, completed, cancelled, etc).

In order for the purchase history to provide all this information in a clear and readable way, the ETL process has to intervene to obtain and process the raw data into the appropriate formats. This is because the data is not stored in the organisation’s systems as it is shown in the history.

If a manager wants to query the history of a single customer, the system would have to limit its search to that single piece of data (out of the vast amount of data the company handles), perform calculations, select the fields to be displayed in the history template and format them for presentation.

Phases of an ETL process

1. Extraction (extract)

The first phase of an ETL process is data extraction. In this phase, the “raw material” is obtained, which will be used in the following two phases, until a practical use is achieved.

The data comes from different origins and sources: system logs, CRM, point of sale and devices used, among others. These sources are likely to be heterogeneous, so the data must be standardised. Data that contain the same type of information, even if they come from different sources, must also be collated into common structures.

The aim of the extraction phase is to synthesise all the information into one or more common structures in which the normalised data are stored, leaving them completely homogenised, organised and ready for the transformation phase. It is possible that in this first stage some transformations specific to the data source are executed. The objective is to optimise the data for the transformation stage.

An example of extraction would be a sales data file aggregated by months and departments, one row for each department and one column for each month, i.e. each new month a column would be added. This format is suitable for dealing with manual files but very inefficient for a database.

On the other hand, most of the I/O (input/output) operations, which are usually quite costly, take place in this technical phase.

2. Transformation (transform)

The transformation phase is where we add value to the data obtained in the extraction phase. This is where the foundation of the ETL process lies.

Being isolated from the first phase, where technical issues are managed, the business layer can be involved in the development of this transformation, with the objective of transforming the data and adding value.

3. Load (load)

The third and final phase is the load. This is when the system receives the duly processed data and stores it in the defined destination. At this stage, the transformation phase is also isolated from possible changes in the destination.

For example, the ETL result may initially be stored in a file for use by the business layer, but later it may be decided to store those results in a database.

In certain environments, issues such as the definition, validation or implementation of a data model and ingest processes can take months. Therefore, the business layer can derive great value from the possibility of generating immediate results in files and later storing them in a database by changing only the loading phase.

Flexible and agile development methodology in ETL processes

The way companies run ETL systems is evolving. Today, everything revolves around an agile and efficient perspective. That is why, according to a Gartner study, 90% of large companies will have hired a Chief Data Officer by 2019. One of the aims of this is to have personnel specialised in ETL optimisations.

The three phases can be developed in an integrated manner, especially when the ETL processes are not complex. However, developing them flexibly and independently has the following advantages:

It is possible to assign each phase to different teams in order to shorten work times.
The development of each phase can be done with different technologies.
Changes affecting one phase (e.g. change in data source) can be isolated from the other phases.
Each stage can be tested separately, since only the stage in which the test is being introduced will be affected.
Work done for one ETL process can be reused for other ETL processes, especially the extract and load phases.
The business layer can be involved in the transformation phase, in order to define the logic of the ETL system, without dealing with technical issues that are isolated in the extraction phase and would make it difficult to understand the business layer.
Results can be generated in the short and long term without affecting the logic of the ETL process, making changes only in the loading phase.

It is important to apply a flexible development approach, as its main importance lies in being able to involve the business layer in the definition of the ETL logic in order to generate results in the short term, without the need to get involved in technical issues.