pythonLogo

Easy ETL with Python: a step by step tutorial

This is a reworked translation from an article written by Francisco Rodriguez Alfaro at Datamanagement.es

Introduction 

These days, Python seems to be the language of choice for anybody working with data, be it Business Intelligence, Data Science, Machine Learning, data integration, etc. Aside from being quite easy to learn and deploy, the main reason for such popularity perhaps lies in the huge quantity of libraries available. Said libraries are some sort of “packages” containing features and functions that avoid a lot of re-coding. Hence, it is a question of recycling code to be implemented where and when needed. 

For this article, an easy but functional ETL (extract-transform-load) process was developed, aiming to demonstrate how uncomplicated it can be to get data from any source, then prepare and store it to be analyzed. 

Objetive 

Suppose you need to analyze orders data from an e-commerce website. Since the transactional information is stored online, you need to download and save such data somewhere, then prepare it for visualization and decision-taking. 

Please download the Python scripts, sql statements and extras from this link here >>  – (Some stuff is in Spanish, but worry not: all explanations are in this article.) 

Now, assume that we can only get the data in XML format. A very common practice, since most current systems (bookings, transactional, etc.) deal with such format. Moreover, the data usually resides on a cloud-based repository like OneDrive, Dropbox, a corporate folder, etc. For our purposes, find a sample file  here >>  

This sample contains 3 .XML files with order details for an equal number of days, from a hypothetical e-commerce. What we want to do with our ETL process is: 

  • Download the .rar file 
  • Unzip the file in a local folder 
  • Parse the XML files obtained in the previous step 
  • Insert the files contains into a database. 
  • Verify that the whole process and the data copied is OK 

This last step is a good idea, since data sometimes does not behave straight! We’ll produce a check table to find out which files were processed, when and how long it took to process them, and if there was any error while processing.  

Setting up the environment  

Anaconda  

Anaconda is a great suite, with all you need to start developing Python right away, including many pre-installed libraries. Download it  here >>  

Please note that the Python version we need to run for this article’s scripts is 3.7.  

MySQL 

Although you could do everything in Python, you’ll find that in real life, almost all projects involve SQL in one way or another. Therefore, for this article we’ll combine the .py scripts with a little bit of SQL, since this tutorial requires that we store and manage data on a database. Since MySQL is one of the most widely used products, we’ll go for it.  

The download packages can be found here >>   

There are quite a lot of installation and set-up tutorials online for MySQL, the official one is here >>    

In this particular case, we used an existing server that comes with our web hosting service. Just create a new database for this tutorial, call it any name you wish. 

Table creation into MySQL 

Next, it’s time to populate the database with the required tables. You’ll find the table creation statements (DDL) in the downloaded .zip file (“AA Instrucciones.txt“).
The information from the XML files we need to convey to our database goes to the following tables: 

  • ctl_activity_process: whatever happens after the Python code runs, will be logged here.  
  • header: this is where the XML’s header tag is stored  
  • pedidos: orders’ headers information (“pedido” means order in Spanish). 
  • pedidos_detalles: orders details 

XML files’ structure 

This is an example of the XML structure:

image

Looking at this structure, we discover that there are 3 types of information: 

  • Data enclosed by the header tag defines that what’s below comes from a certain URL (“page”) and ranges between an initial date and a final date. We definitely need to store that information in the “header” table.  
  • What’s between <pedido> and <detalle> are the order head details. Of course, we need to store those too, and the chosen table for that will be “pedidos”. 
  • Information that lies between <detalle> and </detalle> tags are the order lines, that is, the items that compose the order. In this case it’s product name, unit price, number of units sold, and so on. 

Process development 

You’d better have a minimum knowledge of Python or any other object-oriented programming language to understand what’s going on under the hood here. The general idea with this small project is to define classes which contain the logic in small modules. 

These are the project’s scripts and classes: 

image-3

Config.json 

This file contains the process’ configuration variables. In order for it to successfully run, you’ll need to adjust this file with your own environment variables. Such variables are the name and access path to the database, as well as the downloading and saving paths for the XML files to be processed. 

  • RUTA_DOWNLOAD: to define the folder to save the .zip files coming from the URL_PEDIDOS_DOWNLOAD (the remote repository where the originals are stored) 
  • RUTA_XML: this is the path for the folder where the .rar file coming from RUTA_DOWNLOAD will be uncompressed. 

Here’s an example:  

image-5

Connection.py 

This is the script that takes care of the database connection. It encloses the SQL statements that insert new records and update existing ones in the tables.  

image-1
image-2

GestionArchivos.py 

This class takes care of the file management. It downloads the .rar from the specified URL on the “config.json” file, uncompresses it on the folder defined in said .json, and produces a list of the XMLs to be processed later on by the parser. 
For this class to run properly, you’ll need to install the following modules:  

  • pip install pyunpack 
  • pip install patol 

Now, have a look at the class: 

image-4
image-9

ParserXML.py 

This one processes the uncompressed XML files. It reads the list from the previous, step, gets the headers plus orders details information, then inserts everything into the database, invoking the method “insertRowsToBBDD”. Such method will obtain the connection configuration variables from the .json file. 

Now, the XML parsing process will  look like this in Python:

image-11
image-12
image-15
image-6
image-7

LogTrazabilidad.py 

As mentioned before, it is a good idea to log into a table what went right and -eventually- what went wrong during the whole process. That’s what this class does, in fact, connecting to the database and inserting (into “ctl_activity_process”) a new record stating files parsed, description, date and status, via the iniStatusActivity methodNext time the process runs, the method updateStatusActivity will update the previous record with a new date/timestap and current status. 

Check an example:  

image-13

In this case, the table will show the following result after a first run: 

image-10

Then, on a second run, you’ll see something like this: 

image-14

Wow! It took a second to download the .rar; even less than that to process it and save records into the database. 

image-16

Main.py 

This is the script to be executed, the one that orchestrates the whole process. It creates the connection to the database, download and unzip files, then parses them. Finally, it adds or updates records into the database. What we tried to do is to simplify and recycle code as much as possible, by keeping functionalities in separate modules/classes. 

image-8
image-18
image-17

Neat, right? It took me about 15 minutes to get up and running this whole project. I would have kill to have something like this a few years back!

Thanks to Francisco for another great article, more will follow soon.

> (opens in a new tab)”>Original published here >>