Creating Harlequin Adapter for Apache DataFusion

Demonstration on how to publish to PyPI for Harlequin DataFusion Adapter with Poetry.

tutorial
Author

Daniel Mesejo

Published

July 11, 2024

Modified

July 15, 2024

Welcome to the first post in LETSQL’s tutorial series!

In this blog post, we take a detour from our usual theme of data pipelines to demonstrate how to create and publish a Python package with Poetry

Introduction

Harlequin is a TUI client for SQL databases known for its light-weight extensive support for SQL databases. It is a versatile tool for data exploration and analysis workflows. Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets. However, Harlequin did not have a DataFusion adapter before. Thankfully, it was really easy to add one.

In this post, We’ll demonstrate these concepts by building a Harlequin adapter for DataFusion. And, by way of doing so, we will also cover Poetry’s essential features, project setup, and the steps to publish your package on PyPI.

To get the most out of this guide, you should have a basic understanding of virtual environments, Python packages and modules, and pip. Our objectives are to:

  • Introduce Poetry and its advantages
  • Set up a project using Poetry
  • Develop a Harlequin adapter for DataFusion
  • Prepare and publish the package to PyPI

By the end, you’ll have practical experience with Poetry and an understanding of modern Python package management.

The code implemented in this post is available on GitHub and available in PyPI.

Harlequin

Harlequin is a SQL IDE that runs in the terminal. It provides a powerful and feature-rich alternative to traditional command-line database tools, making it versatile for data exploration and analysis workflows.

Some key things to know about Harlequin:

  • Harlequin supports multiple database adapters, connecting you to DuckDB, SQLite, PostgreSQL, MySQL, and more.
  • Harlequin provides an interactive SQL editor with features like autocomplete, syntax highlighting, and query history. It also has a results viewer that can display large result sets.
  • Harlequin replaces traditional terminal-based database tools with a more powerful and user-friendly interface.
  • Harlequin uses adapter plug-ins as a generic interface to any database.

DataFusion

DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

It ships with it its own CLI, more information can be found here.

Poetry

Poetry is a modern, feature-rich tool that streamlines dependency management and packaging for Python projects, making development more deterministic and efficient. From the documentation:

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on, and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs and can build your project for distribution.

Creating New Adapters for Harlequin

A Harlequin adapter is a Python package that allows Harlequin to work with a database system.

An adapter is a Python package that declares an entry point in the harlequin.adapters group. That entry point should reference a subclass of the HarlequinAdapter abstract base class. This allows Harlequin to discover installed adapters and instantiate a selected adapter at run-time

In addition to the HarlequinAdapter class, the package must also provide implementations for HarlequinConnection, and HarlequinCursor. A more detailed description can be found on this guide.

The Harlequin Adapter Template

The first step for developing a Harlequin adapter is to generate a new repo from the existing harlequin-adapter-template

GitHub templates are repositories that serve as starting points for new projects. They provide pre-configured files, structures, and settings that are copied to new repositories, allowing for quick project setup without the overhead of forking. This feature streamlines the process of creating consistent, well-structured projects based on established patterns.

The harlequin-adapter-template comes with a poetry.lock file and a pyproject.toml file, in addition to some boilerplate code for defining the required classes.

Coding the Adapter

Let’s explore the essential files needed for package distribution before we get into the specifics of coding.

Package configuration

The pyproject.toml file is now the standard for configuring Python packages for publication and other tools. Introduced in PEP 518 and PEP 621, this TOML-formatted file consolidates multiple configuration files into one. It enhances dependency management by making it more robust and standardized.

Poetry, utilizes pyproject.toml to handle the project’s virtual environment, resolve dependencies, and create packages.

The pyproject.toml of the template is as follows:

[tool.poetry]
name = "harlequin-myadapter"
version = "0.1.0"
description = "A Harlequin adapter for <my favorite database>."
authors = ["Ted Conbeer <tconbeer@users.noreply.github.com>"]
license = "MIT"
readme = "README.md"
packages = [
    { include = "harlequin_myadapter", from = "src" },
]

[tool.poetry.plugins."harlequin.adapter"]
my-adapter = "harlequin_myadapter:MyAdapter"

[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
harlequin = "^1.7"

[tool.poetry.group.dev.dependencies]
ruff = "^0.1.6"
pytest = "^7.4.3"
mypy = "^1.7.0"
pre-commit = "^3.5.0"
importlib_metadata = { version = ">=4.6.0", python = "<3.10.0" }

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

As it can be seen:

  • The [tool.poetry] section of the pyproject.toml file is where you define the metadata for your Python package, such as the name, version, description, authors, etc.

  • The [tool.poetry.dependencies] subsection is where you declare the runtime dependencies your project requires. Running poetry add <package> will automatically update this section.

  • The [tool.poetry.dev-dependencies] subsection is where you declare development-only dependencies, like testing frameworks, linters, etc.

  • The [build-system] section is used to store build-related data. In this case, it specifies the build-backend as "poetry.core.masonry.api". In a narrow sense, the core responsibility of a build-backend is to build wheels and sdist.

The repository also includes a poetry.lock file, a Poetry-specific component generated by running poetry install or poetry update. This lock file specifies the exact versions of all dependencies and sub-dependencies for your project, ensuring reproducible installations across different environments.

It’s crucial to avoid manual edits to the poetry.lock file, as this can cause inconsistencies and installation issues. Instead, make changes to your pyproject.toml file and allow Poetry to automatically update the lock file by running poetry lock.

Getting Poetry

Per Poetry’s installation warning

Poetry should always be installed in a dedicated virtual environment to isolate it from the rest of your system. It should in no case be installed in the environment of the project that is to be managed by Poetry.

Here we will presume you have access to Poetry by running pipx install poetry

Developing in the virtual environment

With our file structure clarified, let’s begin the development process by setting up our environment. Since our project already includes pyproject.toml and poetry.lock files, we can initiate our environment using the poetry shell command.

This command activates the virtual environment linked to the current Poetry project, ensuring all subsequent operations occur within the project’s dependency context. If no virtual environment exists, poetry shell automatically creates and activates one.

poetry shell detects your current shell and launches a new instance within the virtual environment. As Poetry centralizes virtual environments by default, this command eliminates the need to locate or recall the specific path to the activate script.

To verify which Python environment is currently in use with Poetry, you can use the following commands:

poetry env list --full-path

This will show all the virtual environments associated with your project and indicate which one is currently active. As an alternative, you can get the full path of only the current environment:

poetry env info -p

With the environment activated, use poetry install to install the required dependencies. The command works as follows

  1. If a poetry.lock file is present, poetry install will use the exact versions specified in that file rather than resolving the dependencies dynamically. This ensures consistent, repeatable installations across different environments.
  1. If you run poetry install and it doesn’t seem to be progressing, you may need to run export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring in the shell you’re installing in
  1. Otherwise, it reads the pyproject.toml file in the current project, resolves the dependencies listed there, and installs them.
  2. If no poetry.lock file exists, poetry install will create one after resolving the dependencies, otherwise it will update the existing one.

To complete the environment setup, we need to add the datafusion library to our dependencies. Execute the following command:

poetry add datafusion

This command updates your pyproject.toml file with the datafusion package and installs it. If you don’t specify a version, Poetry will automatically select an appropriate one based on available package versions.

Implementing the Interfaces

To create a Harlequin Adapter, you need to implement three interfaces defined as abstract classes in the harlequin.adapter module.

The first one is the HarlequinAdapter.

class DataFusionAdapter(HarlequinAdapter):
    def __init__(self, conn_str: Sequence[str], **options: Any) -> None:
        self.conn_str = conn_str
        self.options = options

    def connect(self) -> DataFusionConnection:
        conn = DataFusionConnection(self.conn_str, self.options)
        return conn

The second one is the HarlequinConnection, particularly the methods execute and get_catalog.

 def execute(self, query: str) -> HarlequinCursor | None:
     try:
         cur = self.conn.sql(query)  # type: ignore
         if str(cur.logical_plan()) == "EmptyRelation":
             return None
     except Exception as e:
         raise HarlequinQueryError(
             msg=str(e),
             title="Harlequin encountered an error while executing your query.",
         ) from e
     else:
         if cur is not None:
             return DataFusionCursor(cur)
         else:
             return None

For brevity, we’ve omitted the implementation of the get_catalog function. You can find the full code in the adapter.py file within our GitHub repository.

Finally, a HarlequinCursor implementation must be provided as well:

class DataFusionCursor(HarlequinCursor):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        self.cur = args[0]
        self._limit: int | None = None

    def columns(self) -> list[tuple[str, str]]:
        return [
            (field.name, _mapping.get(field.type, "?")) for field in self.cur.schema()
        ]

    def set_limit(self, limit: int) -> DataFusionCursor:
        self._limit = limit
        return self

    def fetchall(self) -> AutoBackendType:
        try:
            if self._limit is None:
                return self.cur.to_arrow_table()
            else:
                return self.cur.limit(self._limit).to_arrow_table()
        except Exception as e:
            raise HarlequinQueryError(
                msg=str(e),
                title="Harlequin encountered an error while executing your query.",
            ) from e

Making the plugin discoverable

Your adapter must register an entry point in the harlequin.adapters group using the packaging software you use to build your project. If you use Poetry, you can define the entry point in your pyproject.toml file:

[tool.poetry.plugins."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"

An entry point is a mechanism for code to advertise components it provides to be discovered and used by other code.

Notice that registering a plugin with Poetry is equivalent to the following pyproject.toml specification for entry points:

[project.entry-points."harlequin.adapter"]
datafusion = "harlequin_datafusion:DataFusionAdapter"

Testing

The template provides a set of pre-configured tests, some of which are applicable to DataFusion while others may not be relevant. One test that’s pretty cool checks if the plugin can be discovered, which is crucial for ensuring proper integration:

if sys.version_info < (3, 10):
    from importlib_metadata import entry_points
else:
    from importlib.metadata import entry_points


def test_plugin_discovery() -> None:
    PLUGIN_NAME = "datafusion"
    eps = entry_points(group="harlequin.adapter")
    assert eps[PLUGIN_NAME]
    adapter_cls = eps[PLUGIN_NAME].load()
    assert issubclass(adapter_cls, HarlequinAdapter)
    assert adapter_cls == DataFusionAdapter

To make sure the tests are passing, run:

poetry run pytest

The run command executes the given command inside the project’s virtualenv.

Building and Publishing to PyPI

With the tests passing, we’re nearly ready to publish our project. Let’s enhance our pyproject.toml file to make our package more discoverable and appealing on PyPI. We’ll add key metadata including:

  1. A link to the GitHub repository
  2. A path to the README file
  3. A list of relevant classifiers

These additions will help potential users find and understand our package more easily.

classifiers = [
    "Development Status :: 3 - Alpha",
    "Intended Audience :: Developers",
    "Topic :: Software Development :: User Interfaces",
    "Topic :: Database :: Database Engines/Servers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: Implementation :: CPython"
]
readme = "README.md"
repository = "https://github.com/mesejo/datafusion-adapter"

For reference:

  • The complete list of classifiers is available on PyPI’s website.
  • For a detailed guide on writing pyproject.toml, check out this resource.
  • The formal, technical specification for pyproject.toml can be found on packaging.python.org.

Building

We’re now ready to build our library and verify its functionality by installing it in a clean virtual environment. Let’s start with the build process:

poetry build

This command will create distribution packages (both source and wheel) in the dist directory.

The wheel file should have a name like harlequin_datafusion-0.1.1-py3-none-any.whl. This follows the standard naming convention:

  • harlequin_datafusion is the package (or distribution) name
  • 0.1.1 is the version number
  • py3 indicates it’s compatible with Python 3
  • none compatible with any CPU architecture
  • any with no ABI (pure Python)

To test the installation, create a new directory called test_install. Then, set up a fresh virtual environment with the following command:

python -m venv .venv

To activate the virtual environment on MacOS or Linux:

source .venv/bin/activate

After running this command, you should see the name of your virtual environment (.venv) prepended to your command prompt, indicating that the virtual environment is now active.

To install the wheel file we just built, use pip as follows:

pip install /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl

Replace /path/to/harlequin_datafusion-0.1.1-py3-none-any.whl with the actual path to the wheel file you want to install.

If everything works fined, you should see some dependencies installed, and you should be able to do:

harlequin -a datafusion

Congrats! You have built a Python library. Now it is time to share it with the world.

Publishing to PyPI

The best practice before publishing to PyPI is to actually publish to the Test Python Package Index (TestPyPI)

To publish a package to TestPyPI using Poetry, follow these steps:

  1. Create an account at TestPyPI if you haven’t already.

  2. Generate an API token on your TestPyPI account page.

  3. Register the TestPyPI repository with Poetry by running:

    poetry config repositories.test-pypi https://test.pypi.org/legacy/
  4. To publish your package, run:

    poetry publish -r testpypi --username __token__ --password <token>

Replace <token> with the actual token value you generated in step 2. To verify the publishing process, use the following command:

python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>

This command uses two key arguments:

  • --index-url: Directs pip to find your package on TestPyPI.
  • --extra-index-url: Allows pip to fetch any dependencies from the main PyPI repository.

Replace <package-name> with your specific package name (e.g., harlequin-datafusion if following this post). For additional details, consult the information provided in this post.

To publish to the actual Python Package Index (PyPI) instead:

  1. Create an account at https://pypi.org/ if you haven’t already.

  2. Generate an API token on your PyPI account page.

  3. Run:

    poetry publish --username __token__ --password <token>

The default repository is PyPI, so there’s no need to specify it.

Is worth noting that Poetry only supports the Legacy Upload API when publishing your project.

Automated Publishing on GitHub release

Manually publishing each time is repetitive and error-prone, so to fix this problem, let us create a GitHub Action to publish each time we create a release.

Here are the key steps to publish a Python package to PyPI using GitHub Actions and Poetry:

  1. Set up PyPI authentication: You must provide your PyPI credentials (the API token) as GitHub secrets so the GitHub Actions workflow can access them. Name these secrets something like PYPI_TOKEN.

  2. Create a GitHub Actions workflow file: In your project’s .github/workflows directory, create a new file like publish.yml with the following content:

    name: Build and publish python package
    
    on:
      release:
        types: [ published ]
    
    jobs:
      publish-package:
        runs-on: ubuntu-latest
        permissions:
          contents: write
        steps:
          - uses: actions/checkout@v3
          - uses: actions/setup-python@v4
            with:
              python-version: '3.10'
    
          - name: Install Poetry
            uses: snok/install-poetry@v1
    
          - run: poetry config pypi-token.pypi "${{ secrets.PYPI_TOKEN }}"
    
          - name: Publish package
            run: poetry publish --build --username __token__

The key is to leverage GitHub Actions to automate the publishing process and use Poetry to manage your package’s dependencies and metadata.

Conclusion

Poetry is a user-friendly Python package management tool that simplifies project setup and publication. Its intuitive command-line interface streamlines environment management and dependency installation. It supports plugin development, integrates with other tools, and emphasizes testing for robust code. With straightforward commands for building and publishing packages, Poetry makes it easier for developers to share their work with the Python community.

At LETSQL, we’re committed to contributing to the developer community. We hope this blog post serves as a straightforward guide to developing and publishing Python packages, emphasizing best practices and providing valuable resources. To subscribe to our newsletter, visit letsql.com. ## Future Work As we continue to refine the adapter, we would like to provide better autocompletion and direct reading from files (parquet, csv) as in the DataFusion-cli. This requires a tighter integration with the Rust library without going through the Python bindings.

Your thoughts and feedback are invaluable as we navigate this journey. Share your experiences, questions, or suggestions in the comments below or on our community forum. Let’s redefine the boundaries of data science and machine learning integration.

Acknowledgements

Thanks to Dan Lovell and Hussain Sultan for the comments and the thorough review.