Using Docker multi-stage builds to compile turbodbc with PyArrow support on Debian 11
Turbodbc is a Python module for accessing relational databases via the Open Database Connectivity (ODBC) interface. For maximum performance, turbodbc offers built-in NumPy and Apache Arrow support and uses batched data transfer instead of single-record communication, as many other ODBC modules do.
Building turbodbc with PyArrow support has some caveats: it uses build-time detection to check if PyArrow is installed and requires pybind11 and several Debian development packages for the C++ build.
By using Docker multi-stage builds, we can build turbodbc with PyArrow support natively, without adding the development packages to the final image.
The first step is the base image that includes all necessary Debian packages to run turbodbc later:
# syntax=docker/dockerfile:1
FROM debian:bullseye as base
# Create a non-root user. UID must be greater than 1000.
RUN useradd --uid 1100 app --create-home
RUN apt-get update
RUN --mount=type=cache,target=/var/cache/apt apt-get install --yes python3 python3-venv git
RUN --mount=type=cache,target=/var/cache/apt apt-get install --yes libodbc1 odbcinst odbcinst1debian2 binutils-x86-64-linux-gnu
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:${PATH}"
WORKDIR /app/
ENV PYTHONPATH=/app/
In the second stage, we install the build requirements needed only to compile turbodbc with Arrow support. There are two important notes:
- First, PyArrow must be installed before turbodbc is built, because the turbodbc build process automatically detects if PyArrow is available.
- Second, to make the detection work, pass
--no-build-isolationto the turbodbc installation and ensure the Arrow libraries are linked correctly.
FROM base as builder
RUN --mount=type=cache,target=/var/cache/apt apt-get -yq install \
build-essential \
gdb \
lcov \
libbz2-dev \
libffi-dev \
libgdbm-dev \
liblzma-dev \
libboost-dev \
libncurses5-dev \
libreadline6-dev \
libsqlite3-dev \
libssl-dev \
lzma \
lzma-dev \
python3-dev \
tk-dev \
unixodbc-dev \
uuid-dev \
xvfb \
zlib1g-dev
RUN pip3 install -U pip==22.0.4 setuptools==45.2.0 wheel==0.37.1
RUN pip3 install -U pybind11==2.10.1 numpy==1.23.5 pandas==1.5.2 six==1.16.0 pyarrow==5.0.0
RUN python3 -c "import pyarrow; pyarrow.create_library_symlinks()" \
&& CPPFLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" pip3 install --no-build-isolation turbodbc==4.5.5
In the third stage, we create a fresh image and only reuse the venv with the turbodbc build packages:
FROM base as runner
COPY --from=builder /opt/venv /opt/venv
COPY requirements.txt /app/requirements.txt
RUN --mount=type=cache,target=/root/.cache pip install --requirement /app/requirements.txt
# Set the user created above
USER 1100
CMD []
