Imagine you have five machine learning projects going on, all of which involve extracting data from a database and creating features from this raw data. Let’s say that these functions are common across your slate of projects, and in the status quo, you have to make changes to each version of your data extraction/feature creation code every time you tweak one instance of it. In this case, you might want to wrap up all the relevant code into a package, instead of including it in the scripts for each project. In this post, we’ll cover the basics and benefits of packaging your Python code.
(In this post, the code snippets will reference my prior post on building safe, easy-to-use database connections in Python — you don’t need to read it to understand what we’ll talk about here, but it couldn’t hurt!)
Packaging allows you to manage your projects independently and conveniently. It’s especially valuable in larger projects with distinct process stages: ETL, cleansing, feature engineering, modeling, output, etc. Imagine you have five machine learning projects, all of which involve extracting data from a database and creating features from this raw data. If you have these functions all bundled up into packages, you just have to import them into each project instead of copy/pasting them for each and maintaining each instance separately.
As a practical example, let’s say that your customer decided that they want to make a change to their database, and the ETL functions you’ve created (
customer_features, we’ll say) have to change accordingly. Packaging can reduce the headache that this entails: instead of changing the code for five different projects (i.e. commit, push, and deploy all changes for each), you just have to change the code for the
customer_clean class in your one and only data ETL package.
So let’s package it up! With our two hypothetical ETL classes (
customer_features) included, the codebase we want to include in our package should resemble the following, which we’ll call
< functions.py >
Let’s start the process by ensuring you have
pip install setuptools
Next, check that your directory matches this structure:
LICENSE # see below
setup.py # see below
Below is an example version of
setup.py. Each entry’s meaning should be pretty straightforward; some of them can be ignored (e.g.
classifiers), but I recommended you fill them out anyway. Of course, you must also give your package a name. Here, we’ll keep it simple and refer to our code as
client_package. (Note that we are using the
cx_Oracle package for database connections, so don’t forget to include it in
import setuptools# if you have a README.md file
with open("README.md", "r") as fh:
long_description = fh.read()
description="Supporting packages for project",
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
__init__.py file is for telling Python that this directory is a package — you can leave it empty.
Photo Credit: Pexels
After it’s created, there are two ways to use our package:
- Install and use it locally
- Build a wheel (and potentially publish it for the world to use)
The first option is generally used for development. With local installation, when we run
import client_package, Python just looks in the current directory.
The second option is for a tested, stable version of a package. It’ll install the project and make it accessible to the entire Python environment. When building a wheel, we generate a
.whl file and install it. After doing so, the package will be located in
Let’s take a look at the local approach first. In multiple projects, you may see people use
import to import from scripts in the project directory. A common practice is to run the following command in the topmost such directory:
pip install -e .
This tells Python to install your project as a package, but instead of installing it to
<Python path>/Lib, it only looks for the package in your project directory. You’ll just need a
setup.py (like the one above) alongside any code you wish to package up. You can also bundle scripts located in a subdirectory, making the resulting package available in the directory in which the command is run:
pip install -e /example_sub_dir
Instead of adding your project directory to your
PATH (check this for more about adding to
PATH), this local approach is safer (it doesn’t impact our environment) and more convenient. It’s also easy to use: on a new machine, just
git pull your project, create a new environment, and then run the above command to automatically install all dependencies.
The second approach is standard for creating publicly available packages, but might also suit your needs for local use too. If you want to build your project into a wheel (and even publish it), you can run the following command in the
python setup.py bdist_wheel
And doing so will give you a structure like this:
client_package-0.0.1-py3-none-any.whl file is the wheel file you want. You can simply use
pip install client_package-0.0.1-py3-none-any.whl to install it right into a Python environment, neatly and robustly. Usage is as simple as importing and calling the classes and methods within:
from client_package.functions import oracle_connectionwith oracle_connection() as conn:
data = pd.read_sql(sql, conn.connector)
Packaging has many benefits, from easing the burden of code maintenance across multiple projects, to increasing adoption both internally and externally. Happy packaging!