This article is based on my Creating ETL pipelines without code in the cloud: Azure Data Factory presentation.
I leave links to my presentation and script document for participant comments and review.
In my day-to-day job, I have received many inquiries about different units of the university. This has led me to explore more efficient ways to prepare the data and serve it according to the needs of the requested indicators.
Today I share one of my latest research that is related to the cloud, using Azure that has credits for students. I will make an introduction to the topic to be able to explain the importance and the problems that cloud services solve.
ETL for everyone
Extract, Transform and Load are words that are widely used in the world of data engineering.
It is normal because it is necessary to carry out and provide infrastructure for information management in the organization.
ETL Pipelines for its acronym Extract, Transform, Load show an orderly way to perform these repetitive processes.
In this article I show how you can create ETL services in the Azure cloud, one of the main tools that we have access to with packages such as students with $100 credits each year and also because of its importance in the industry.
The role of the data engineer
With the growing influx of data, we have seen the need to be able to carry out processes that allow us to store and analyze them in a reliable way and at the speed of business growth.
I like the definition of Vlad Riscutia:
Data engineering is the part of data science that deals with the practical applications of data collection and analysis. Its goal is to bring engineering rigor to the process of building and supporting reliable data systems.Vlad Riscutia
The cloud explained in 3 minutes
I will not go into many technical details about the cloud, I will disclose
“The cloud is a huge set of resources that supports a variety of services.”
The cloud refers to software and services that run on the Internet. This is in contrast to software and services that run locally on your computer. The cloud is also used to store data. When you store data or use a cloud service, you can access the data from anywhere, using any device that connects to the Internet, not just from your computer’s hard drive or your company’s network server.
My definition of cloud is the typical rental model but computer oriented.
For example, if you need to move in a country as a tourist, you rent a car. If you haven’t cooked, you can buy a pizza and satisfy your hunger. The cloud is a service.
This relates to the terms IaaS, PaaS, and SaaS. Making an analogy if you want to make a candy, you have the following options.
If you don’t have an oven at home, or a mold or any material, you can rent an oven, buy the tools, buy the ingredients and make the sweet from scratch, this gives you more freedom but increases the complexity and time, this would be an infrastructure like service. A model more oriented to a confectionery business on a scale.
If you have an oven and implements at home, you can buy the ingredients and make the sweet, an equivalent to the platform as a service. It may be for a special event.
Otherwise, if you don’t want to have to deal with any of those inconveniences and it’s only for a birthday, you can buy the sweet already made. This would be a SaaS.
So as you understand, there are many ways to complete the task. The shape will always be determined by need, not the other way around.
With this in mind we will see our service today.
Azure services for engineers and Data Analysts
Azure has many services for engineers
Creating our Azure Data Factory environment
- Have an account in Azure. create it here
- Purchase a subscription and link it to your account
Creating a resource group
Creating a resource group and giving it a name like azure-data-factory
Within this group we will create a storage account (Storage account) and a Data Factory.
Within the storage account we will create 2 containers. (both can be private access)
- A container called input
- Another one called output
Within the input we are going to load the movies.csv and load it.
From there we can see the content of the csv, with the id, title and genre columns.
Using the Data Factory
We will enter the resource group and open the data factory. Here are some tutorials and templates that we can use. But for this tutorial we won’t be using any templates so we go to the author tab in the upper left corner.
To load the data, we will use a new connection and select the connection by azure blob storage, assign it a name (blob), select our azure subscription, select the storage account and test the connection.
If everything is correct, we click on create.
Creating the datasets
Get the dataset here
Once the connection is ready, we go to the datasets tab, a csv, in linked service we select the blob and look for our input folder, import the headers and import the schema of a connection.
We will create another dataset for the output, with the same parameters except for the output and the import schema.
Creating a data flow
With the data flow created, we can see the new screen where we will perform all the transformations.
Before performing any type of transformation we can click on data flow debug so that we can see the output of each transformation, this takes some time (4min-5min).
Creating a source we can see the different parameters such as data type, transformations, etc.
Inside the title of our dataset, I will do an extraction of the year and I will create a column where we will place the extracted year.
Creating Year column and getting its name
It should be noted that we will do these transformations in Scala, we will use the following code
toInteger(trim(right(title, 6), '()'))
We can see in the following transformation the changes and the new dataset.
Creating Title column
Now we will make a cut of the year that appears in the title with this code.
With this, we can see the transformations and the new dataset again. If we want to save, we can hit new step, sink and save it with a specific name.
To finish we can give output to single file, in the settings tab of the sink step, this is a somewhat heavy step, so I do not recommend it for long projects, for larger projects we will have better options.
We place the output of the file.
So we’ll create a new data pipeline and select the data flow.
From there we will click on debug to see that everything is fine, this can take some time.
Running this we can see our blob storage in the storage container again.
From here we can see all the operation we have performed, see the processing times, etc.
Additionally, we can do multiple transformations in the same dataset, by adding a parallel step in any of our previous steps, we do this using a new branch.
We are going to create a new step that will be an aggregate, we will name the column with MoviesCount() with a simple function that is count().
Thus, different years can appear to us with the total number of films per year.
And we use Sink again using MoviesByYearData. From there we can save it in MoviesByYearData.csv optimization
We can use a simple partition for this small dataset, this is not recommended for all types of data, especially if it is very large.
In Dhe case of the aggregate we will use a simple partition, with key (key).
We can do this in the previous step as well and reduce the loading time of this pipeline.
Tools like Azure Data Factory can carry out data transformation processes in a simple way.
I decided to create a demo video in case there is a failure on the day of the event.
It also serves as an aid if the reader misses any detail.