Peter Hoffmann

Using Docker multi-stage builds to compile turbodbc with PyArrow support on Debian 11

Turbodbc is a Python module for accessing relational databases via the Open Database Connectivity (ODBC) interface. For maximum performance, turbodbc offers built-in NumPy and Apache Arrow support and uses batched data transfer instead of single-record communication, as many other ODBC modules do.

Building turbodbc with PyArrow support has some caveats: it uses build-time detection to check if PyArrow is installed and requires pybind11 and several Debian development packages for the C++ build.

By using Docker multi-stage builds, we can build turbodbc with PyArrow support natively, without adding the development packages to the final image.

The first step is the base image that includes all necessary Debian packages to run turbodbc later:

# syntax=docker/dockerfile:1

FROM debian:bullseye as base

# Create a non-root user. UID must be greater than 1000.
RUN useradd --uid 1100 app --create-home

RUN apt-get update
RUN --mount=type=cache,target=/var/cache/apt  apt-get install --yes python3 python3-venv git
RUN --mount=type=cache,target=/var/cache/apt  apt-get install --yes libodbc1 odbcinst odbcinst1debian2 binutils-x86-64-linux-gnu
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:${PATH}"
WORKDIR /app/
ENV PYTHONPATH=/app/

In the second stage, we install the build requirements needed only to compile turbodbc with Arrow support. There are two important notes:

FROM base as builder
RUN  --mount=type=cache,target=/var/cache/apt  apt-get -yq install \
    build-essential \
    gdb \
    lcov \
    libbz2-dev \
    libffi-dev \
    libgdbm-dev \
    liblzma-dev \
    libboost-dev \
    libncurses5-dev \
    libreadline6-dev \
    libsqlite3-dev \
    libssl-dev \
    lzma \
    lzma-dev \
    python3-dev \
    tk-dev \
    unixodbc-dev \
    uuid-dev \
    xvfb \
    zlib1g-dev


RUN pip3 install -U pip==22.0.4 setuptools==45.2.0 wheel==0.37.1

RUN pip3 install -U pybind11==2.10.1 numpy==1.23.5 pandas==1.5.2 six==1.16.0 pyarrow==5.0.0

RUN python3 -c "import pyarrow; pyarrow.create_library_symlinks()" \
    && CPPFLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" pip3 install  --no-build-isolation turbodbc==4.5.5

In the third stage, we create a fresh image and only reuse the venv with the turbodbc build packages:

FROM base as runner
COPY --from=builder /opt/venv /opt/venv

COPY requirements.txt /app/requirements.txt

RUN --mount=type=cache,target=/root/.cache  pip install  --requirement /app/requirements.txt

# Set the user created above
USER 1100

CMD []