An Apache Airflow MVP: Complete Guide for a Basic Production Installation Using LocalExecutor


An Apache Airflow Pipeline. Souce: Unsplash### Recap

In the first post of our series, we learned a bit about Apache Airflow and how it can help us build not only Data Engineering & ETL pipelines, but also other types of relevant workflows within advanced analytics, such as MLOps workloads.

We skimmed briefly through some of its building blocks, namely Sensors, Operators, Hooks and Executors. These components provide the basic foundation for working with Apache Airflow. Back then, we worked with the SequentialExecutor, the simplest possible Airflow setup. Having support for running only one task at a time, it is used mainly for simple demonstrations. That is obviously not enough for production scenarios, where we might want to have many tasks and workflows being executed in parallel.

As we already discussed, Apache Airflow ships with support for multiple types of Executors — each of them is more suited to a particular type of scenario.

An Apache Airflow MVP

When starting up a data team or capability, evaluating cost versus benefit as well as complexity versus added value is a critical, time consuming, daunting task. Agile organizations and startups usually work with prototypes to tackle such scenario — sometimes working products are better at answering questions than people.

Inspired by this philosophy, we will create a basic, hypothetical setup for an Apache Airflow production environment. We will have a walkthrough on how to deploy such an environment using the LocalExecutor, one of the possible Apache Airflow task mechanisms.

For a production prototype, choosing LocalExecutor is justified by the following reasons:

LocalExecutor

You might be asking — how is that possible? Well, as the name indicates, when we use the LocalExecutor we are basically running all Airflow components from the same physical environment. When we look at the Apache Airflow architecture, this is what we are talking about:

Main Airflow Components for a LocalExecutor Setup. Source: AuthorWe have multiple OS processes running the Web Server, Scheduler and Workers. We can think of LocalExecutor in abstract terms as the layer that makes the interface between the Scheduler and the Workers. Its function is basically spinning up Workers in order to execute the tasks from Airflow DAGs, while monitoring its status and completion.

Getting The Wheels Turning

We had a conceptual introduction on LocalExecutor. Without further ado, let’s setup our environment. Our work will revolve around the following:

  1. Postgresql Installation and Configuration
  2. Apache Airflow Installation
  3. Apache Airflow Configuration
  4. Testing
  5. Setting up Airflow to run as a Service

These steps were tested with Ubuntu 18.04 LTS, but they should work with any Debian based Linux distro. Here we assume that you already have Python 3.6+ configured. If that’s not the case, please refer to this post.

Note: you could also use a managed instance of PostgreSql, such as Azure Database for PostgreSql or Amazon RDS for PostgreSql, for instance. This is in fact recommended for a production setup, since that would remove maintenance and backup burden.

1. Postgresql Installation and Configuration

To install PostgreSql we can simply run the following in our prompt:

sudo apt-get install postgresql postgresql-contrib

In a few seconds, PostgreSql should be installed.

Next, we need to set it up. First step is creating a psql object:

sudo -u postgres psql

We proceed to setting up the required user, database and permissions:

postgres=# CREATE USER airflow PASSWORD 'airflow'; #you might wanna change this  
CREATE ROLE  
postgres=# CREATE DATABASE airflow;  
CREATE DATABASE  
postgres=# GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;  
GRANT

Finally, we need to install libpq-dev for enabling us to implement a PostgreSql client:

sudo apt install libpq-dev

Optional Step 1: you can make your setup more secure by restricting the connections to your database only to the local machine. To do this, you need to change the IP addresses in the pg_hba.conf file:

sudo vim /etc/postgresql/12/main/pg\_hba.conf

PostgreSql Configurations (pg_hba.conf)Optional Step 2: you might want to configure PostgreSql to start automatically whenever you boot. To do this:

sudo update-rc.d postgresql enable

2. Apache Airflow Installation

We will install Airflow and its dependencies using pip:

pip install apache-airflow['postgresql']  
pip install psycopg2

By now you should have Airflow installed. By default, Airflow gets installed to ~/.local/bin. Remember to run the following command:

export PATH=$PATH:/home/your_user/.local/bin/

This is required so that the system knows where to locate Airflow’s binary.

Note: for this example we are not using virtualenv or Pipenv, but you can feel free to use it. Just make sure that environment dependencies are properly mapped when you setup Airflow to run as a service :)

3. Apache Airflow Configuration

Now we need to configure Airflow to use LocalExecutor and to use our PostgreSql database.

Go to Airflow’s installation directory and edit airflow.cfg.

vim airflow.cfg

Make sure that the executor parameter is set to LocalExecutor and SqlAlchemy connection string is set accordingly:

Airflow configuration for LocalExecutorFinally, we need to initialize our database:

airflow initdb

Make sure that no error messages were displayed as part of initdb’s output.

4. Testing

It is time to check if Airflow is properly working. To do that, we spin up the Scheduler and the Webserver:

airflow scheduler  
airflow webserver

Once you fire up your browser and point to your machine’s IP, you should see a fresh Airflow installation:

#### 5. Setting up Airflow to Run as a Service

Our last step is to configure the daemon for the Scheduler and the Webserver services. This is required so that we ensure that Airflow gets automatically restarted in case there is a failure, or after our machine is rebooted.

As an initial step, we need to configure Gunicorn. Since by default it is not installed globally, we need to create a symbolic link for it.

sudo ln -fs $(which gunicorn) /bin/gunicorn

Next, we create service files for Webserver and Scheduler:

sudo touch /etc/systemd/system/airflow-webserver.service  
sudo touch /etc/systemd/system/airflow-scheduler.service

Our airflow-webserver.service must look like the following:

#  
# Licensed to the Apache Software Foundation (ASF) under one  
# or more contributor license agreements. See the NOTICE file  
# distributed with this work for additional information  
# regarding copyright ownership. The ASF licenses this file  
# to you under the Apache License, Version 2.0 (the  
# “License”); you may not use this file except in compliance  
# with the License. You may obtain a copy of the License at  
#  
# <http://www.apache.org/licenses/LICENSE-2.0>  
#  
# Unless required by applicable law or agreed to in writing,  
# software distributed under the License is distributed on an  
# “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY  
# KIND, either express or implied. See the License for the  
# specific language governing permissions and limitations  
# under the License.
[Unit]  
Description=Airflow webserver daemon  
After=network.target postgresql.service mysql.service  
Wants=postgresql.service mysql.service  
[Service]  
EnvironmentFile=/etc/environment  
User=airflow  
Group=airflow  
Type=simple  
ExecStart= /home/airflow/.local/bin/airflow webserver  
Restart=on-failure  
RestartSec=5s  
PrivateTmp=true  
[Install]  
WantedBy=multi-user.target

Similarly, we add the following content to airflow-scheduler.service:

#  
# Licensed to the Apache Software Foundation (ASF) under one  
# or more contributor license agreements. See the NOTICE file  
# distributed with this work for additional information  
# regarding copyright ownership. The ASF licenses this file  
# to you under the Apache License, Version 2.0 (the  
# “License”); you may not use this file except in compliance  
# with the License. You may obtain a copy of the License at  
#  
# <http://www.apache.org/licenses/LICENSE-2.0>  
#  
# Unless required by applicable law or agreed to in writing,  
# software distributed under the License is distributed on an  
# “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY  
# KIND, either express or implied. See the License for the  
# specific language governing permissions and limitations  
# under the License.
[Unit]  
Description=Airflow scheduler daemon  
After=network.target postgresql.service mysql.service  
Wants=postgresql.service mysql.service  
[Service]  
EnvironmentFile=/etc/environment  
User=airflow  
Group=airflow  
Type=simple  
ExecStart=/home/airflow/.local/bin/airflow scheduler  
Restart=always  
RestartSec=5s  
[Install]  
WantedBy=multi-user.target

Note: depending on the directory where you installed Airflow, your ExecStart variable might need to be changed.

Now we just need to reload our system daemon, enable and start our services:

sudo systemctl daemon-reload  
sudo systemctl enable airflow-scheduler.service  
sudo systemctl start airflow-scheduler.service  
sudo systemctl enable airflow-webserver.service  
sudo systemctl start airflow-webserver.service

Our services should have been started successfully. To confirm that:

$ sudo systemctl status airflow-webserver.service  
$ sudo systemctl status airflow-scheduler.service

You should see some output stating that both services are active and enabled. For example, for the Webserver, you should see something similar to this:

That’s it. Now you have a basic Production setup for Apache Airflow using the LocalExecutor, which allows you to run DAGs containing parallel tasks and/or run multiple DAGs at the same time. This is definitely a must-have for any kind of serious use case — which I also plan on showcasing on a future post.

Of course, there are many possible improvements here:

But for now, we will leave these steps for a future article.

I’m glad that you made it until here and I hope you found it useful. Check out my other articles:

Keeping Your Machine Learning Models on the Right Track: Getting Started with MLflow, Part 2
Learn how to use MLflow Model Registry to track, register and deploy Machine Learning Models effectively.mlopshowto.com
Keeping Your Machine Learning Models on the Right Track: Getting Started with MLflow, Part 1
Learn why Model Tracking and MLflow are critical for a successful machine learning projectmlopshowto.com
A Journey Into Machine Learning Observability with Prometheus and Grafana, Part I
Deploying Prometheus and Grafana on Kubernetes in 10 minutes for basic infrastructure monitoringmlopshowto.com