Peter Hoffmannhttp://peter-hoffmann.com/feed/atom.xml2023-12-23T00:00:00ZAtom Feed for peter-hoffmann.comWerkzeugExploring Mountain Huts with SPARQL and Wikidatahttp://peter-hoffmann.com/2023/exploring-mountain-huts-with-sparql-and-wikidata.html2023-12-23T00:00:00Z2023-12-23T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p>For outdoor enthusiasts and avid hikers, searching for mountain huts or shelters
is a common thing in tour planning. In this blog post,
we'll embark on a journey to retrieve information about mountain huts around a
specific latitude and longitude using the powerful combination of SPARQL and
Wikidata.</p>
<p>Understanding SPARQL:</p>
<p>SPARQL (SPARQL Protocol and RDF Query Language) is a query language designed for
querying data stored in Resource Description Framework (RDF) format. RDF
provides a standardized way of representing information, making it an ideal
choice for querying diverse datasets.</p>
<p>Accessing Wikidata:</p>
<p>Wikidata, a collaborative knowledge base, hosts a wealth of information on
various topics, including geographical features like mountain huts. By employing
SPARQL queries on the Wikidata platform, we can extract specific details about
these huts based on their geographic coordinates.</p>
<p>Retrieving Mountain Huts:</p>
<p>Let's delve into a SPARQL query to retrieve information about mountain huts
around a given latitude and longitude. The following query can be used as a
starting point:</p>
<pre class="highlight"><code><span class="n">SELECT</span> <span class="n">DISTINCT</span> <span class="err">?</span><span class="n">distance</span> <span class="err">?</span><span class="n">place</span> <span class="err">?</span><span class="n">placeLabel</span> <span class="err">?</span><span class="n">lat</span> <span class="err">?</span><span class="n">long</span> <span class="err">?</span><span class="n">elevation</span> <span class="n">WHERE</span> <span class="p">{</span>
<span class="n">SERVICE</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">around</span> <span class="p">{</span>
<span class="c1"># Looking for items with coordinate locations(P625)</span>
<span class="err">?</span><span class="n">place</span> <span class="n">wdt</span><span class="p">:</span><span class="n">P625</span> <span class="err">?</span><span class="n">location</span> <span class="o">.</span>
<span class="c1"># That are in a circle with a centre of with a point</span>
<span class="n">bd</span><span class="p">:</span><span class="n">serviceParam</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">center</span> <span class="s2">"Point(8.114444,46.521944)"</span><span class="o">^^</span><span class="n">geo</span><span class="p">:</span><span class="n">wktLiteral</span> <span class="o">.</span>
<span class="c1"># Where the circle has a radius of km</span>
<span class="n">bd</span><span class="p">:</span><span class="n">serviceParam</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">radius</span> <span class="s2">"10"</span> <span class="o">.</span>
<span class="n">bd</span><span class="p">:</span><span class="n">serviceParam</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">distance</span> <span class="err">?</span><span class="n">distance</span> <span class="o">.</span>
<span class="p">}</span>
<span class="err">?</span><span class="n">place</span> <span class="n">p</span><span class="p">:</span><span class="n">P625</span> <span class="err">?</span><span class="n">coordinataes</span> <span class="o">.</span>
<span class="err">?</span><span class="n">coordinataes</span> <span class="n">psv</span><span class="p">:</span><span class="n">P625</span> <span class="p">[</span>
<span class="n">wikibase</span><span class="p">:</span><span class="n">geoLatitude</span> <span class="err">?</span><span class="n">lat</span><span class="p">;</span>
<span class="n">wikibase</span><span class="p">:</span><span class="n">geoLongitude</span> <span class="err">?</span><span class="n">long</span>
<span class="p">]</span> <span class="o">.</span>
<span class="err">?</span><span class="n">place</span> <span class="n">wdt</span><span class="p">:</span><span class="n">P31</span> <span class="err">?</span><span class="n">subclassOf</span> <span class="o">.</span>
<span class="c1"># bivouac shelter (Q879208) or mountain hut (Q182676) </span>
<span class="n">VALUES</span> <span class="err">?</span><span class="n">subclassOf</span> <span class="p">{</span> <span class="n">wd</span><span class="p">:</span><span class="n">Q879208</span> <span class="n">wd</span><span class="p">:</span><span class="n">Q182676</span><span class="p">}</span> <span class="o">.</span>
<span class="c1"># Use the label service to get the label with fallback languages</span>
<span class="n">SERVICE</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">label</span> <span class="p">{</span> <span class="n">bd</span><span class="p">:</span><span class="n">serviceParam</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">language</span> <span class="s2">"de,ch,en,fr,it"</span> <span class="o">.</span> <span class="p">}</span>
<span class="n">OPTIONAL</span> <span class="p">{</span> <span class="err">?</span><span class="n">place</span> <span class="n">wdt</span><span class="p">:</span><span class="n">P2044</span> <span class="err">?</span><span class="n">elevation</span><span class="o">.</span> <span class="p">}</span>
<span class="p">}</span>
<span class="n">ORDER</span> <span class="n">BY</span> <span class="err">?</span><span class="n">distance</span>
</code></pre><p>Replace "LONGITUDE" and "LATITUDE" in the query with the desired coordinates. This query retrieves mountain huts within a specified radius (in this example, 10 kilometers) of the given location.</p>
<p>You can also search around a wikidata location by setting the wikdidata id (eg Q68103 Interlaken) as a reference:</p>
<pre class="highlight"><code><span class="n">wd</span><span class="p">:</span><span class="n">Q68103</span> <span class="n">wdt</span><span class="p">:</span><span class="n">P625</span> <span class="err">?</span><span class="n">mainLoc</span> <span class="o">.</span>
<span class="c1"># Use the around service</span>
<span class="n">SERVICE</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">around</span> <span class="p">{</span>
<span class="c1"># Looking for items with coordinate locations(P625)</span>
<span class="err">?</span><span class="n">place</span> <span class="n">wdt</span><span class="p">:</span><span class="n">P625</span> <span class="err">?</span><span class="n">location</span> <span class="o">.</span>
<span class="c1"># That are in a circle with a centre of ?mainLoc(The coordinate location)</span>
<span class="n">bd</span><span class="p">:</span><span class="n">serviceParam</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">center</span> <span class="err">?</span><span class="n">mainLoc</span> <span class="o">.</span>
<span class="c1"># Where the circle has a radius of km</span>
<span class="n">bd</span><span class="p">:</span><span class="n">serviceParam</span> <span class="n">wikibase</span><span class="p">:</span><span class="n">radius</span> <span class="s2">"40"</span> <span class="o">.</span>
<span class="p">}</span>
</code></pre><p>To try it out just copy/past the example into <a href="https://query.wikidata.org">https://query.wikidata.org</a></p>
<p>You can use the python library SPARQLWrapper to retrieve the results via an api in Json format:</p>
<pre class="highlight"><code><span class="kn">from</span> <span class="nn">SPARQLWrapper</span> <span class="kn">import</span> <span class="n">SPARQLWrapper</span><span class="p">,</span> <span class="n">JSON</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="n">query</span> <span class="o">=</span>
</code></pre><pre class="highlight"><code><span class="n">user_agent</span> <span class="o">=</span> <span class="s2">"python test"</span>
<span class="n">endpoint_url</span> <span class="o">=</span> <span class="s2">"https://query.wikidata.org/sparql"</span>
<span class="n">sparql</span> <span class="o">=</span> <span class="n">SPARQLWrapper</span><span class="p">(</span><span class="n">endpoint_url</span><span class="p">,</span> <span class="n">agent</span><span class="o">=</span><span class="n">user_agent</span><span class="p">)</span>
<span class="n">sparql</span><span class="o">.</span><span class="n">setQuery</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="n">sparql</span><span class="o">.</span><span class="n">setReturnFormat</span><span class="p">(</span><span class="n">JSON</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">sparql</span><span class="o">.</span><span class="n">queryAndConvert</span><span class="p">()[</span><span class="s1">'results'</span><span class="p">][</span><span class="s1">'bindings'</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">4</span><span class="p">))</span>
</code></pre>BlueYonder at PyCon.DE 2023http://peter-hoffmann.com/2023/blueyonder-at-pyconde-2023.html2023-04-25T00:00:00Z2023-04-25T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p><strong>Blue Yonder History</strong></p>
<p>It's been now 10 years ago when <a href="https://blueyonder.com">Blue Yonder</a> started the first sponsoring of a python
conference at Europython Florence. Since then we have been either sponsoring
and/or organizing at least one python event per year. For me personally this
has always been part of my internal mission to convince leadership and fellow
team leads that participating in the open source community has great benefit
on employee development and on the overall corporate culture. Young Engineers
learn to represent the company and connect to other open source developers.</p>
<p>This year PyCon.DE was hosted in Berlin. With 1500 attendees the conference has
been growing tremendously over the past years. The Berlin Congress Center is of
course a very professional venue and the orga committee did an awesome job
to run a smooth conference overall. Still I hope the <a href="https://pycon.de">PyCon.de</a> will be hosted
in another city next year (maybe even in Switzerland or Austria). Leipzig, Frankfurt,
Hamburg, Basel, Bern or Wien would be awesome locations for 2024.</p>
<p><strong>Cyclic Boosting</strong></p>
<p>The main topic of this year's Blue Yonder conference booth was around the open
sourcing of <a href="https://cyclicboosting.org">Cyclic Boosting</a>. Cyclic Boosting has
been the core ML algorithm of Blue Yonder for many years. Felix Wick gave a
talk about exploring the power of <a href="https://pretalx.com/pyconde-pydata-berlin-2023/talk/MYARJG/">cyclic boosting: a python-pure, explainable,
and efficient ml
method</a></p>
<p>We also hosted a small Kaggle <a href="https://www.kaggle.com/competitions/blueyonder-pyconpydata-2023/overview">ML/Retail
Challenge</a>
where the open source community can apply their ML algorithms on a typical
retail challenge and benchmark it against a baseline CyclicBoosting Algorithm.</p>
<p>The challenge is to accurately forecast the demand of 300 retail products
across 20 retail stores. Accurately forecasting demand is crucial for retailers
because it allows them to make informed decisions regarding inventory
management, pricing strategies, and sales projections, all of which can have a
significant impact on a retailer's bottom line. In short, demand forecasting is
an essential tool for retailers looking to optimize their operations and
maximize profits so they can stay competitive in a crowded retail landscape.</p>
<p>It was quite fun to discuss solutions and technical approaches at our booth,
and people really engaged to beat the master implementation from Felix.</p>
<p><strong>Notable PyCon.DE talks</strong></p>
<p><a href="https://2023.pycon.de/program/TP7ABB/">Wald: A Modern and Sustainable Analytics Stack</a> from Florian Wilhelm.</p>
<p>The name WALD-stack stems from the four technologies it is composed of, i.e. a
cloud-computing Warehouse like Snowflake or Google BigQuery, the open-source
data integration engine Airbyte, the open-source full-stack BI platform
Lightdash, and the open-source data transformation tool DBT.</p>
<p><a href="https://pretalx.com/pyconde-pydata-berlin-2023/talk/MQHTHY/">Pragmatic ways of using Rust in your data project</a> from Christopher Prohm</p>
<p>Writing efficient data pipelines in Python can be tricky. The standard
recommendation is to use vectorized functions implemented in Numpy, Pandas, or
the like. However, what to do, when the processing task does not fit these
libraries? Using plain Python for processing can result in lacking performance,
in particular when handling large data sets.</p>
<p><a href="https://pretalx.com/pyconde-pydata-berlin-2023/talk/9Q38VT/">Actionable Machine Learning in the Browser with PyScript</a> from Valero Maggio</p>
<p>PyScript brings the full PyData stack in the browser, opening up to
unprecedented use cases for interactive data-intensive applications. In this
scenario, the web browser becomes a ubiquitous computing platform, operating
within a (nearly) zero-installation & server-less environment.</p>
<p><strong>Hiring</strong></p>
<p>We are looking for new talent in our AI/ML Teams at Blue Yonder. If you are
interested in one of the following positions:</p>
<ul>
<li><a href="https://2023.pycon.de/job-board/14/">https://2023.pycon.de/job-board/14/</a> Senior Machine Learning Engineer</li>
<li><a href="https://2023.pycon.de/job-board/15/">https://2023.pycon.de/job-board/15/</a> Full Stack Developer</li>
<li><a href="https://2023.pycon.de/job-board/16/">https://2023.pycon.de/job-board/16/</a> Senior Data Engineer</li>
</ul>
<p>Or just reach out to me or you can of course also directly apply to <a href="https://jda.wd5.myworkdayjobs.com/JDA_Careers">https://jda.wd5.myworkdayjobs.com/JDA_Careers</a>.</p>
Using docker multistage build to build turbodbc with pyarrow support on Debian 11http://peter-hoffmann.com/2022/using-docker-multistage-build-to-build-turbodbc-with-pyarrow-support-on-debian-11.html2022-12-16T00:00:00Z2022-12-16T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p><a href="https://turbodbc.readthedocs.io/en/latest/">Turbodbc</a> is a Python module to
access relational databases via the Open Database Connectivity (ODBC) interface.
For maximum performance, turbodbc offers built-in NumPy and Apache Arrow support
and internally relies on batched data transfer instead of single-record
communication as other popular ODBC modules do.</p>
<p>Building turbodbc with pyarrow support has some caveats as it has build
time detection if pyarrow is installed and needs pybind and several debian
dev packages to get the C++ compilation.</p>
<p>By using <a href="https://docs.docker.com/build/building/multi-stage/">docker multistage builds</a>
we can natively build turbodbc with pyarrow support without getting the dev packages
into the final image.</p>
<p>First step is the base image that has all necessary debian packages to run turbodbc later on:</p>
<pre class="highlight"><code><span class="c1"># syntax=docker/dockerfile:1</span>
<span class="n">FROM</span> <span class="n">debian</span><span class="p">:</span><span class="n">bullseye</span> <span class="k">as</span> <span class="n">base</span>
<span class="c1"># Create user, must not be ROOT and UID should be greater than 1000</span>
<span class="n">RUN</span> <span class="n">useradd</span> <span class="o">--</span><span class="n">uid</span> <span class="mi">1100</span> <span class="n">app</span> <span class="o">--</span><span class="n">create</span><span class="o">-</span><span class="n">home</span>
<span class="n">RUN</span> <span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">update</span>
<span class="n">RUN</span> <span class="o">--</span><span class="n">mount</span><span class="o">=</span><span class="nb">type</span><span class="o">=</span><span class="n">cache</span><span class="p">,</span><span class="n">target</span><span class="o">=/</span><span class="n">var</span><span class="o">/</span><span class="n">cache</span><span class="o">/</span><span class="n">apt</span> <span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">install</span> <span class="o">--</span><span class="n">yes</span> <span class="n">python3</span> <span class="n">python3</span><span class="o">-</span><span class="n">venv</span> <span class="n">git</span>
<span class="n">RUN</span> <span class="o">--</span><span class="n">mount</span><span class="o">=</span><span class="nb">type</span><span class="o">=</span><span class="n">cache</span><span class="p">,</span><span class="n">target</span><span class="o">=/</span><span class="n">var</span><span class="o">/</span><span class="n">cache</span><span class="o">/</span><span class="n">apt</span> <span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">install</span> <span class="o">--</span><span class="n">yes</span> <span class="n">libodbc1</span> <span class="n">odbcinst</span> <span class="n">odbcinst1debian2</span> <span class="n">binutils</span><span class="o">-</span><span class="n">x86</span><span class="o">-</span><span class="mi">64</span><span class="o">-</span><span class="n">linux</span><span class="o">-</span><span class="n">gnu</span>
<span class="n">RUN</span> <span class="n">python3</span> <span class="o">-</span><span class="n">m</span> <span class="n">venv</span> <span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">venv</span>
<span class="n">ENV</span> <span class="n">PATH</span><span class="o">=</span><span class="s2">"/opt/venv/bin:$</span><span class="si">{PATH}</span><span class="s2">"</span>
<span class="n">WORKDIR</span> <span class="o">/</span><span class="n">app</span><span class="o">/</span>
<span class="n">ENV</span> <span class="n">PYTHONPATH</span><span class="o">=/</span><span class="n">app</span><span class="o">/</span>
</code></pre><p>In the second stage we install the build requirements that are only needed to
compile turbodbc with arrow support. There are two important notes:</p>
<p>Firstly pyarrow has to be installed before turbodbc is build as the
turbodbc build process automatically detects if pyarrow is available.</p>
<p>To make the detection work you need to pass <code>--no-build-isolation</code> to
the turbodbc install and make sure the arrow libraries are linked correctly.</p>
<pre class="highlight"><code><span class="n">FROM</span> <span class="n">base</span> <span class="k">as</span> <span class="n">builder</span>
<span class="n">RUN</span> <span class="o">--</span><span class="n">mount</span><span class="o">=</span><span class="nb">type</span><span class="o">=</span><span class="n">cache</span><span class="p">,</span><span class="n">target</span><span class="o">=/</span><span class="n">var</span><span class="o">/</span><span class="n">cache</span><span class="o">/</span><span class="n">apt</span> <span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="o">-</span><span class="n">yq</span> <span class="n">install</span> \
<span class="n">build</span><span class="o">-</span><span class="n">essential</span> \
<span class="n">gdb</span> \
<span class="n">lcov</span> \
<span class="n">libbz2</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libffi</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libgdbm</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">liblzma</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libboost</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libncurses5</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libreadline6</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libsqlite3</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">libssl</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">lzma</span> \
<span class="n">lzma</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">python3</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">tk</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">unixodbc</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">uuid</span><span class="o">-</span><span class="n">dev</span> \
<span class="n">xvfb</span> \
<span class="n">zlib1g</span><span class="o">-</span><span class="n">dev</span>
<span class="n">RUN</span> <span class="n">pip3</span> <span class="n">install</span> <span class="o">-</span><span class="n">U</span> <span class="n">pip</span><span class="o">==</span><span class="mf">22.0</span><span class="o">.</span><span class="mi">4</span> <span class="n">setuptools</span><span class="o">==</span><span class="mf">45.2</span><span class="o">.</span><span class="mi">0</span> <span class="n">wheel</span><span class="o">==</span><span class="mf">0.37</span><span class="o">.</span><span class="mi">1</span>
<span class="n">RUN</span> <span class="n">pip3</span> <span class="n">install</span> <span class="o">-</span><span class="n">U</span> <span class="n">pybind11</span><span class="o">==</span><span class="mf">2.10</span><span class="o">.</span><span class="mi">1</span> <span class="n">numpy</span><span class="o">==</span><span class="mf">1.23</span><span class="o">.</span><span class="mi">5</span> <span class="n">pandas</span><span class="o">==</span><span class="mf">1.5</span><span class="o">.</span><span class="mi">2</span> <span class="n">six</span><span class="o">==</span><span class="mf">1.16</span><span class="o">.</span><span class="mi">0</span> <span class="n">pyarrow</span><span class="o">==</span><span class="mf">5.0</span><span class="o">.</span><span class="mi">0</span>
<span class="n">RUN</span> <span class="n">python3</span> <span class="o">-</span><span class="n">c</span> <span class="s2">"import pyarrow; pyarrow.create_library_symlinks()"</span> \
<span class="o">&&</span> <span class="n">CPPFLAGS</span><span class="o">=</span><span class="s2">"-D_GLIBCXX_USE_CXX11_ABI=0"</span> <span class="n">pip3</span> <span class="n">install</span> <span class="o">--</span><span class="n">no</span><span class="o">-</span><span class="n">build</span><span class="o">-</span><span class="n">isolation</span> <span class="n">turbodbc</span><span class="o">==</span><span class="mf">4.5</span><span class="o">.</span><span class="mi">5</span>
</code></pre><p>In the third stage we create a fresh stage and only reuse venv with the turbodbc build packages</p>
<pre class="highlight"><code><span class="n">FROM</span> <span class="n">base</span> <span class="k">as</span> <span class="n">runner</span>
<span class="n">COPY</span> <span class="o">--</span><span class="n">from</span><span class="o">=</span><span class="n">builder</span> <span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">venv</span> <span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">venv</span>
<span class="n">COPY</span> <span class="n">requirements</span><span class="o">.</span><span class="n">txt</span> <span class="o">/</span><span class="n">app</span><span class="o">/</span><span class="n">requirements</span><span class="o">.</span><span class="n">txt</span>
<span class="n">RUN</span> <span class="o">--</span><span class="n">mount</span><span class="o">=</span><span class="nb">type</span><span class="o">=</span><span class="n">cache</span><span class="p">,</span><span class="n">target</span><span class="o">=/</span><span class="n">root</span><span class="o">/.</span><span class="n">cache</span> <span class="n">pip</span> <span class="n">install</span> <span class="o">--</span><span class="n">requirement</span> <span class="o">/</span><span class="n">app</span><span class="o">/</span><span class="n">requirements</span><span class="o">.</span><span class="n">txt</span>
<span class="c1"># Set the User we created above</span>
<span class="n">USER</span> <span class="mi">1100</span>
<span class="n">CMD</span> <span class="p">[]</span>
</code></pre>beautiful leaflet markers with folium and fontawesomehttp://peter-hoffmann.com/2022/beautiful-leaflet-markers-with-folium-and-fontawesome.html2022-12-04T00:00:00Z2022-12-04T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p><a href="https://python-visualization.github.io/folium/">Folium</a> is a Python library
that allows users to create and display interactive maps. The library uses the
Leaflet.js library and is capable of creating powerful and visually appealing
maps. Folium can be used to visualize geographical data by adding markers,
polygons, heatmaps, and other geographical elements onto a map. The library is
easy to use and offers a range of options for customizing the maps and the
elements that are displayed on them.</p>
<h4>Minimal marker</h4>
<p>The minimal example just adds a marker at a specific location:</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">folium</span>
<span class="n">loc</span> <span class="o">=</span> <span class="p">[</span><span class="mf">45.957916666667</span><span class="p">,</span> <span class="mf">7.8123888888889</span><span class="p">]</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">folium</span><span class="o">.</span><span class="n">Map</span><span class="p">(</span><span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">,</span> <span class="n">zoom_start</span><span class="o">=</span><span class="mi">13</span><span class="p">)</span>
<span class="n">folium</span><span class="o">.</span><span class="n">Marker</span><span class="p">(</span>
<span class="n">location</span><span class="o">=</span><span class="n">loc</span>
<span class="p">)</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="n">m</span>
</code></pre><p><img src="/static/2022/folium-marker-1.png" alt=""></p>
<h4>Marker with a bootstrap icon</h4>
<p>Markers can be customized through providing a Icon instance. As a default you can use <a href="https://getbootstrap.com/docs/3.3/components/">bootstrap</a> glyphicons that provide over 250 glyphs for free:</p>
<p><img src="/static/2022/folium-glyphicons.png" alt=""></p>
<p>In addition you can colorize the marker. Available color names are <code>red blue green purple orange darkred lightred beige darkblue darkgreen cadetblue darkpurple white pink lightblue lightgreen gray black lightgray</code> .</p>
<pre class="highlight"><code><span class="n">m</span> <span class="o">=</span> <span class="n">folium</span><span class="o">.</span><span class="n">Map</span><span class="p">(</span><span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">)</span>
<span class="n">folium</span><span class="o">.</span><span class="n">Marker</span><span class="p">(</span>
<span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">,</span>
<span class="n">icon</span><span class="o">=</span><span class="n">folium</span><span class="o">.</span><span class="n">Icon</span><span class="p">(</span><span class="n">icon</span><span class="o">=</span><span class="s2">"home"</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s2">"purple"</span><span class="p">,</span>
<span class="n">icon_color</span><span class="o">=</span><span class="s2">"blue"</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
</code></pre><p><img src="/static/2022/folium-marker-2.png" alt=""></p>
<h4>Marker with a fonteawesome icon</h4>
<p><a href="https://fontawesome.com/search?q=tent&o=r">Font Awesome</a> is a collection of scalable vector icons that can be customized and
used in a variety of ways, such as in graphic design projects, websites, and
applications. The icons are available in different styles, including Solid,
Regular, and Brands, and can be easily integrated by adding the <code>fa</code> prefix</p>
<p><img src="/static/2022/folium-fontawesome.png" alt=""></p>
<pre class="highlight"><code><span class="n">m</span> <span class="o">=</span> <span class="n">folium</span><span class="o">.</span><span class="n">Map</span><span class="p">(</span><span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">)</span>
<span class="n">folium</span><span class="o">.</span><span class="n">Marker</span><span class="p">(</span>
<span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">,</span>
<span class="n">icon</span><span class="o">=</span><span class="n">folium</span><span class="o">.</span><span class="n">Icon</span><span class="p">(</span><span class="n">icon</span><span class="o">=</span><span class="s2">"tents"</span><span class="p">,</span> <span class="n">prefix</span><span class="o">=</span><span class="s1">'fa'</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
</code></pre><p><img src="/static/2022/folium-marker-3.png" alt=""></p>
<h3>Extended Marker Customization with BeautifulIcons</h3>
<p>The <a href="https://github.com/masajid390/BeautifyMarker">Leaflet Beautiful Icons</a>
is lightweight plugin that adds colorful iconic markers
without images for Leaflet by giving full control of style to end user ( i.e.
unlimited colors and many more...).</p>
<p><img src="/static/2022/folium-beautiful-icons.png" alt=""></p>
<p>It ist exposed to folium via the
<a href="https://python-visualization.github.io/folium/plugins.html#folium.plugins.BeautifyIcon">Beautiful Icon plugin</a></p>
<p>Supported icon shapes are <code>circle circle-dot doughnut rectangle rectangle-dot marker</code> and the color
be either one of the predefined or any valid hex code.</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">folium.plugins</span> <span class="k">as</span> <span class="nn">plugins</span>
<span class="n">folium</span><span class="o">.</span><span class="n">Marker</span><span class="p">(</span>
<span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">,</span>
<span class="n">icon</span><span class="o">=</span><span class="n">plugins</span><span class="o">.</span><span class="n">BeautifyIcon</span><span class="p">(</span>
<span class="n">icon</span><span class="o">=</span><span class="s2">"tent"</span><span class="p">,</span>
<span class="n">icon_shape</span><span class="o">=</span><span class="s2">"circle"</span><span class="p">,</span>
<span class="n">border_color</span><span class="o">=</span><span class="s1">'purple'</span><span class="p">,</span>
<span class="n">text_color</span><span class="o">=</span><span class="s2">"#007799"</span><span class="p">,</span>
<span class="n">background_color</span><span class="o">=</span><span class="s1">'yellow'</span>
<span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
</code></pre><p><img src="/static/2022/folium-marker-4.png" alt=""></p>
<h3>Custom Markers with a DivIcon</h3>
<p>If you want to add more textual information, you can always use plain html with a <a href="https://python-visualization.github.io/folium/modules.html#folium.features.DivIcon">DivIcon</a>. A DivIcon represents a lightweight icon for markers that uses a simple div element instead of an image.</p>
<pre class="highlight"><code><span class="n">html</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1"><span style="size: 10px; background-color: lightblue; "></span>
<span class="s1"><i class="fa-solid fa-tents"> </i></span>
<span class="s1">Monte Rosa Hut</span>
<span class="s1"></span></span>
<span class="s1">'''</span>
<span class="n">folium</span><span class="o">.</span><span class="n">map</span><span class="o">.</span><span class="n">Marker</span><span class="p">(</span><span class="n">location</span><span class="o">=</span><span class="n">loc</span><span class="p">,</span>
<span class="n">icon</span><span class="o">=</span><span class="n">DivIcon</span><span class="p">(</span>
<span class="n">icon_size</span><span class="o">=</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span><span class="mi">20</span><span class="p">),</span>
<span class="n">icon_anchor</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">),</span>
<span class="n">html</span><span class="o">=</span><span class="n">html</span> <span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">add_to</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
</code></pre><p><img src="/static/2022/folim-marker-5.png" alt=""></p>
scaling-aware rating of count forecastshttp://peter-hoffmann.com/2022/scaling-aware-rating-of-count-forecasts.html2022-12-01T00:00:00Z2022-12-01T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p>Our ML group at <a href="https://blueyonder.com">Blue Yonder</a> around Malte Tichy
has published a new paper:</p>
<blockquote><p>Granular forecasts in the regime of low count rates - as they often occur in
retail, for which an intermittent demand of a handful might be observed per
product, day, and location - are dominated by the inevitable statistical
uncertainty of the Poisson distribution. This makes it hard to judge whether a
certain metric value is dominated by Poisson noise or truly indicates a bad
prediction model. To make things worse, every evaluation metric suffers from
scaling: Its value is mostly defined by the predicted selling rate and the
resulting rate-dependent Poisson noise, and only secondarily by the quality of
the forecast. For any metric, comparing two groups of forecasted products often
yields "the slow movers are performing worse than the fast movers" or vice versa -
the naïve scaling trap. To distill the intrinsic quality of a forecast, we
stratify predictions into buckets of approximately equal rate and evaluate
metrics for each bucket separately. By comparing the achieved value per bucket
to benchmarks, we obtain a scaling-aware rating of count forecasts. Our
procedure avoids the naïve scaling trap, provides an immediate intuitive
judgment of forecast quality, and allows to compare forecasts for different
products or even industries. <a href="https://arxiv.org/abs/2211.16313">https://arxiv.org/abs/2211.16313</a></p>
</blockquote>
<p>Malte will also talk about this on <a href="https://global2022.pydata.org/cfp/talk/3HJRZM/">Pydata Global</a>:</p>
<blockquote><p>Meaningful probabilistic models do not only produce a “best guess” for the
target, but also convey their uncertainty, i.e., a belief in how the target is
distributed around the predicted estimate. Business evaluation metrics such as
mean absolute error, a priori, neglect that unavoidable uncertainty. This talk
discusses why and how to account for uncertainty when evaluating models using
traditional business metrics, using python standard tooling. The resulting
uncertainty-aware model rating satisfies the requirements of statisticians
because it accounts for the probabilistic process that generates the target. It
should please practitioners because it is based on established business metrics.
It appeases executives because it allows concrete quantitative goals and
non-defensive judgements.</p>
</blockquote>
Python Support in Snowflakehttp://peter-hoffmann.com/2022/snowflake-python-support.html2022-11-16T00:00:00Z2022-11-16T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p>The <a href="https://www.snowflake.com/en/">snowflake</a> cloud-based data storage and
analytics service has python capabilities as part of their offerings. Snowpark
offers native python integration into snowflake's execution engine. So python can
be used to extend, call and trigger data pipelines inside their managed virtual
warehouse infrastructure.</p>
<p>The snowpark python integration offers three ways to interact with python inside
snowflake:</p>
<p><strong>Python User defined functions:</strong> User defined function in Snowflake is called as
a part of the SQL statements to extend functionality that is not part of the
standard SQL interface. To address the performance Issus that come with row
wise execution, snowpark also offers a vectorized mini batch interface for
user defined functions.</p>
<p><strong>Python Stored Procedures:</strong> Stored procedures in Snowflake are called as an
independent statement, you cannot call a stored procedure as part of an
expression. A stored procedure can return a value, but this can not be passed
to another operation. It's possible to execute multiple statements within a
stored procedures.</p>
<p><strong>Snowpark Python Dataframe API:</strong> A dataframe/pyspark like API to query snowflake
data and execute data pipelines. Snowflake transparently transforms the
dataframe statements to SQL expressions on execution time and heavily benefits
from the SQL query optimizer.</p>
<h2>Creating a scalar user defined function in python</h2>
<p>You can define user <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-creating.html">defined python functions</a> and all them like normal sql
functions from snowflake. The udf are scalar functions where each row is passed
into the udf and a single value is returned. Compared to build in sql functions
or UDFs in javascript the runtime performance is rather poor as snowflake has
to convert every value into a python type and do the same on the output side.
For performance critical statements, snowflake offers batch UDF api that works
with pandas dataframes (see below).</p>
<p>Still python scalar UDFs are incredibly useful if you want to extend
your sql statements with the power of python code.</p>
<pre class="highlight"><code><span class="n">CREATE</span> <span class="n">OR</span> <span class="n">REPLACE</span> <span class="n">FUNCTION</span> <span class="n">sizeof_fmt</span><span class="p">(</span><span class="n">val</span> <span class="n">number</span><span class="p">)</span>
<span class="n">returns</span> <span class="n">text</span>
<span class="n">language</span> <span class="n">python</span>
<span class="n">runtime_version</span> <span class="o">=</span> <span class="mf">3.8</span>
<span class="n">handler</span> <span class="o">=</span> <span class="s1">'fn'</span>
<span class="n">AS</span>
<span class="err">$$</span>
<span class="k">def</span> <span class="nf">fn</span><span class="p">(</span><span class="n">val</span><span class="p">):</span>
<span class="k">for</span> <span class="n">unit</span> <span class="ow">in</span> <span class="p">[</span><span class="s1">''</span><span class="p">,</span><span class="s1">'Ki'</span><span class="p">,</span><span class="s1">'Mi'</span><span class="p">,</span><span class="s1">'Gi'</span><span class="p">,</span><span class="s1">'Ti'</span><span class="p">,</span><span class="s1">'Pi'</span><span class="p">,</span><span class="s1">'Ei'</span><span class="p">,</span><span class="s1">'Zi'</span><span class="p">]:</span>
<span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">val</span><span class="p">)</span> <span class="o"><</span> <span class="mf">1024.0</span><span class="p">:</span>
<span class="k">return</span> <span class="s2">"</span><span class="si">{:3.1f}{}</span><span class="s2">B"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="n">unit</span><span class="p">)</span>
<span class="n">val</span> <span class="o">/=</span> <span class="mf">1024.0</span>
<span class="k">return</span> <span class="s2">"</span><span class="si">{:.1f}{}</span><span class="s2">B"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="s1">'Yi'</span><span class="p">)</span>
<span class="err">$$</span>
<span class="p">;</span>
</code></pre><p>The example below calculates a a human readable version for large numbers and
uses it within the query to get the database sizes from the information schema:</p>
<pre class="highlight"><code><span class="n">select</span>
<span class="n">usage_date</span><span class="p">,</span>
<span class="n">database_name</span><span class="p">,</span>
<span class="n">average_database_bytes</span><span class="p">,</span>
<span class="n">sizeof_fmt</span><span class="p">(</span><span class="n">average_database_bytes</span><span class="p">)</span>
<span class="kn">from</span>
<span class="nn">table</span><span class="p">(</span><span class="n">snowflake</span><span class="o">.</span><span class="n">information_schema</span><span class="o">.</span><span class="n">database_storage_usage_history</span><span class="p">(</span>
<span class="n">dateadd</span><span class="p">(</span><span class="s1">'days'</span><span class="p">,</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span><span class="n">current_date</span><span class="p">()),</span><span class="n">current_date</span><span class="p">()));</span>
</code></pre><p>This then gives us a nice representation as a resultset:</p>
<p><img src="/static/2022/snowflake-function.png" alt=""></p>
<h2>Create User defined function with the python UDF batch api.</h2>
<p>The <a href="https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html">Python UDF Batch API</a>
offers a much more performant way to execute data on batches of rows. This is achieved
by exposing an interface that directly works on Pandas DataFrames or numpy arrays.</p>
<p>The following example is a very trivial one to just use arithmetic in pandas. In a follow up blog post
we will use this functionality to do online scoring with an logistic regression from sklearn.</p>
<pre class="highlight"><code><span class="n">create</span> <span class="n">function</span> <span class="n">add_one_to_inputs</span><span class="p">(</span><span class="n">x</span> <span class="n">number</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="n">y</span> <span class="n">number</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="n">returns</span> <span class="n">number</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">language</span> <span class="n">python</span>
<span class="n">runtime_version</span> <span class="o">=</span> <span class="mf">3.8</span>
<span class="n">packages</span> <span class="o">=</span> <span class="p">(</span><span class="s1">'pandas'</span><span class="p">)</span>
<span class="n">handler</span> <span class="o">=</span> <span class="s1">'add_one_to_inputs'</span>
<span class="k">as</span> <span class="err">$$</span>
<span class="kn">import</span> <span class="nn">pandas</span>
<span class="kn">from</span> <span class="nn">_snowflake</span> <span class="kn">import</span> <span class="n">vectorized</span>
<span class="nd">@vectorized</span><span class="p">(</span><span class="nb">input</span><span class="o">=</span><span class="n">pandas</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">,</span> <span class="n">max_batch_size</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">add_one_to_inputs</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="k">return</span> <span class="n">df</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
<span class="err">$$</span><span class="p">;</span>
</code></pre><p>The pandas user defined function then can be used as usual within sql statements</p>
<pre class="highlight"><code><span class="k">with</span> <span class="n">features</span> <span class="k">as</span> <span class="p">(</span>
<span class="n">select</span>
<span class="n">row_number</span><span class="p">()</span> <span class="n">over</span> <span class="p">(</span><span class="n">order</span> <span class="n">by</span> <span class="n">false</span><span class="p">)</span> <span class="k">as</span> <span class="n">a</span><span class="p">,</span>
<span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">row_number</span><span class="p">()</span> <span class="n">over</span> <span class="p">(</span><span class="n">order</span> <span class="n">by</span> <span class="n">false</span><span class="p">))</span> <span class="k">as</span> <span class="n">b</span><span class="p">,</span>
<span class="n">uniform</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="n">random</span><span class="p">())</span> <span class="k">as</span> <span class="n">c</span>
<span class="kn">from</span> <span class="nn">table</span><span class="p">(</span><span class="n">generator</span><span class="p">(</span><span class="n">rowcount</span> <span class="o">=></span> <span class="mi">10</span><span class="p">))</span>
<span class="p">)</span>
<span class="n">select</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">add_one_to_inputs</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="kn">from</span> <span class="nn">features</span><span class="p">;</span>
</code></pre><p><img src="/static/2022/snowflake-pandas-udf.png" alt=""></p>
<h2>Python Stored Procedures</h2>
<p><a href="https://docs.snowflake.com/en/sql-reference/stored-procedures-python.html">Stored procedures in Snowflake</a>
are called as an independent statement, you cannot call a stored procedure as
part of an expression. A stored procedure can return a value, but this can not
be passed to another operation.</p>
<p>It's possible to execute multiple statements within a stored procedures. Inside
a stored procedure you access to the same session object as within the
<a href="https://docs.snowflake.com/en/developer-guide/snowpark/index.html">python snowpark api</a></p>
<p>The session object is passed implicitly into the execution function.</p>
<pre class="highlight"><code><span class="n">CREATE</span> <span class="n">OR</span> <span class="n">REPLACE</span> <span class="n">PROCEDURE</span> <span class="n">MYPROC</span><span class="p">()</span>
<span class="n">RETURNS</span> <span class="n">STRING</span>
<span class="n">LANGUAGE</span> <span class="n">PYTHON</span>
<span class="n">RUNTIME_VERSION</span> <span class="o">=</span> <span class="s1">'3.8'</span>
<span class="n">PACKAGES</span> <span class="o">=</span> <span class="p">(</span><span class="s1">'snowflake-snowpark-python'</span><span class="p">)</span>
<span class="n">HANDLER</span> <span class="o">=</span> <span class="s1">'run'</span>
<span class="n">AS</span>
<span class="err">$$</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">session</span><span class="p">):</span>
<span class="n">stm</span> <span class="o">=</span> <span class="s1">'CREATE OR REPLACE TABLE sample_product_data (id INT, parent_id INT, category_id INT, name VARCHAR, serial_number VARCHAR, key INT, "3rd" INT)'</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">session</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="n">stm</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="k">return</span> <span class="nb">str</span><span class="p">(</span><span class="n">res</span><span class="p">)</span>
<span class="err">$$</span><span class="p">;</span>
</code></pre><p>It is also possible to execute multiple statements within the stored procedure.
This makes it useful to be executed for db maintenance tasks et al.</p>
<p>The stored procedure can be called as any other native stored procedure:</p>
<pre class="highlight"><code><span class="n">call</span> <span class="n">MYPROC</span><span class="p">();</span>
</code></pre><p>It will create the sample_product_data table and yield the following output:</p>
<p><img src="/static/2022/snowflake-stored-procedure.png" alt=""></p>
Convert the Himalayan Database to SQLitehttp://peter-hoffmann.com/2021/convert-the-himalayan-database-to-sqlite.html2021-01-10T00:00:00Z2021-01-10T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p>The Himalayan database is a record of expeditions in the Nepalese Himalayas
and a unique source of knowledge about the history of the Himalaya
mountaineering. The database is based on the expedition archives of Elizabeth
Hawley, a longtime journalist based in Kathmandu, and it is supplemented by
information gathered from books, alpine journals and correspondence with
Himalayan climbers. The records go back until 1903.</p>
<p>The database was maintained by the legendary Elizabeth Hawley from Kathmandu
until her retirement. If you are interested in some more details of the
fascinating live of Elizabeth Hawley I can recommend the book <a href="">I'll
Call You in Kathmandu: The Elizabeth Hawley Story</a> from <a href="">Bernadette
McDonald</a> about the early days of Himalaya expeditions and the her live in
Kathmandu.</p>
<p>In 2017, a new non-profit organization (<a href="">The Himalayan Database</a>) was
established to continue the work of Elizabeth Hawley who retired in 2016.
Elizabeth's long term assistant <a href="">Billi Bierling</a> has taken over the
role as a Managing Director and continues to maintain and update the
database with a team of record collectors in Kathmandu and around the world .
As a result, version 2 of the Himalayan Database has now been released to the
general public at no charge via Internet download.</p>
<p>The Himalayan database is a FoxPro application developed and maintained by Richard Salisbury who
worked as a computer programmer at the University of Michigan and travelled to Nepal more than 50
times for trekking and expeditions. In 1991 after his encounter with Elizabeth Hawley they
started to digitalize Elizabeth's notes and created the first version of the Himalayan database.</p>
<p>While there is a documentation available how to run the the FoxPro
application with <a href="">crossover</a> on OSX, I have been more interested to
directly query the contents from python, so I have written a small tool to
convert it to a <a href="">SQLite</a> database.</p>
<p>The current Himalayan database (version 2.3 with Autumn 2019-Winter 2019-Spring 2020
update) can be downloaded from <a href="">Himalayan database download page</a>.</p>
<pre class="highlight"><code><span class="err">$</span> <span class="n">mkdir</span> <span class="n">download</span>
<span class="err">$</span> <span class="n">wget</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">www</span><span class="o">.</span><span class="n">himalayandatabase</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">downloads</span><span class="o">/</span><span class="n">Himalayan</span><span class="o">%</span><span class="mi">20</span><span class="n">Database</span><span class="o">.</span><span class="n">zip</span> <span class="o">-</span><span class="n">O</span> <span class="n">download</span><span class="o">/</span><span class="n">Himalayan_Database</span><span class="o">.</span><span class="n">zip</span>
<span class="err">$</span> <span class="n">unzip</span> <span class="n">download</span><span class="o">/</span><span class="n">Himalayan_Database</span><span class="o">.</span><span class="n">zip</span> <span class="o">-</span><span class="n">d</span> <span class="n">download</span><span class="o">/</span>
</code></pre><p>The zip file includes the application to run the FoxPro version, and the <code>HIMDATA</code> folder includes the necessary
database <code>.DBF</code> files.</p>
<pre class="highlight"><code><span class="err">$</span> <span class="n">tree</span> <span class="n">download</span><span class="o">/</span><span class="n">Himalayan</span>\ <span class="n">Database</span><span class="o">/</span>
<span class="n">download</span><span class="o">/</span><span class="n">Himalayan</span>\ <span class="n">Database</span><span class="o">/</span>
<span class="err">├──</span> <span class="n">HIMDATA</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">FILTERS</span><span class="o">.</span><span class="n">FPT</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">SETUP</span><span class="o">.</span><span class="n">DBF</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">exped</span><span class="o">.</span><span class="n">CDX</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">exped</span><span class="o">.</span><span class="n">DBF</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">exped</span><span class="o">.</span><span class="n">FPT</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">filters</span><span class="o">.</span><span class="n">CDX</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">filters</span><span class="o">.</span><span class="n">DBF</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">members</span><span class="o">.</span><span class="n">CDX</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">members</span><span class="o">.</span><span class="n">DBF</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">members</span><span class="o">.</span><span class="n">FPT</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">peaks</span><span class="o">.</span><span class="n">CDX</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">peaks</span><span class="o">.</span><span class="n">DBF</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">peaks</span><span class="o">.</span><span class="n">FPT</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">refer</span><span class="o">.</span><span class="n">CDX</span>
<span class="err">│</span> <span class="err">├──</span> <span class="n">refer</span><span class="o">.</span><span class="n">DBF</span>
<span class="err">│</span> <span class="err">└──</span> <span class="n">refer</span><span class="o">.</span><span class="n">FPT</span>
<span class="err">├──</span> <span class="n">Himal</span>\ <span class="mf">2.3</span><span class="o">.</span><span class="n">exe</span>
<span class="err">├──</span> <span class="n">MSVCR71</span><span class="o">.</span><span class="n">DLL</span>
<span class="err">├──</span> <span class="n">VFP9R</span><span class="o">.</span><span class="n">DLL</span>
<span class="err">└──</span> <span class="n">VFP9RENU</span><span class="o">.</span><span class="n">DLL</span>
</code></pre><p>In the following script uses the python library <a href="https://dbfread.readthedocs.io/en/latest/">dbfread</a> to access the
dbf file format and convert it to a more convenient sqlite database. To run
the script you need to install it with <code>pip install dbfread</code> into your
virtual environment.</p>
<pre class="highlight"><code><span class="ch">#!/usr/bin/env python</span>
<span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="kn">from</span> <span class="nn">dbfread</span> <span class="kn">import</span> <span class="n">DBF</span>
<span class="k">def</span> <span class="nf">get_fields</span><span class="p">(</span><span class="n">table</span><span class="p">):</span>
<span class="sd">"""get the fields and sqlite types for a dbf table"""</span>
<span class="n">typemap</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"F"</span><span class="p">:</span> <span class="s2">"FLOAT"</span><span class="p">,</span>
<span class="s2">"L"</span><span class="p">:</span> <span class="s2">"BOOLEAN"</span><span class="p">,</span>
<span class="s2">"I"</span><span class="p">:</span> <span class="s2">"INTEGER"</span><span class="p">,</span>
<span class="s2">"C"</span><span class="p">:</span> <span class="s2">"TEXT"</span><span class="p">,</span>
<span class="s2">"N"</span><span class="p">:</span> <span class="s2">"REAL"</span><span class="p">,</span> <span class="c1"># because it can be integer or float</span>
<span class="s2">"M"</span><span class="p">:</span> <span class="s2">"TEXT"</span><span class="p">,</span>
<span class="s2">"D"</span><span class="p">:</span> <span class="s2">"DATE"</span><span class="p">,</span>
<span class="s2">"T"</span><span class="p">:</span> <span class="s2">"DATETIME"</span><span class="p">,</span>
<span class="s2">"0"</span><span class="p">:</span> <span class="s2">"INTEGER"</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">fields</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">table</span><span class="o">.</span><span class="n">fields</span><span class="p">:</span>
<span class="n">fields</span><span class="p">[</span><span class="n">f</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">typemap</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">type</span><span class="p">,</span> <span class="s2">"TEXT"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">fields</span>
<span class="k">def</span> <span class="nf">create_table_statement</span><span class="p">(</span><span class="n">table_name</span><span class="p">,</span> <span class="n">fields</span><span class="p">):</span>
<span class="n">defs</span> <span class="o">=</span> <span class="s2">", "</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s1">'"</span><span class="si">%s</span><span class="s1">" </span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span> <span class="p">(</span><span class="n">fname</span><span class="p">,</span> <span class="n">ftype</span><span class="p">)</span> <span class="k">for</span> <span class="p">(</span><span class="n">fname</span><span class="p">,</span> <span class="n">ftype</span><span class="p">)</span> <span class="ow">in</span> <span class="n">fields</span><span class="o">.</span><span class="n">items</span><span class="p">()])</span>
<span class="n">sql</span> <span class="o">=</span> <span class="s1">'create table "</span><span class="si">%s</span><span class="s1">" (</span><span class="si">%s</span><span class="s1">)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">table_name</span><span class="p">,</span> <span class="n">defs</span><span class="p">)</span>
<span class="k">return</span> <span class="n">sql</span>
<span class="k">def</span> <span class="nf">insert_table_statement</span><span class="p">(</span><span class="n">table_name</span><span class="p">,</span> <span class="n">fields</span><span class="p">):</span>
<span class="n">refs</span> <span class="o">=</span> <span class="s2">", "</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s2">":"</span> <span class="o">+</span> <span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">fields</span><span class="o">.</span><span class="n">keys</span><span class="p">()])</span>
<span class="n">sql</span> <span class="o">=</span> <span class="s1">'insert into "</span><span class="si">%s</span><span class="s1">" values (</span><span class="si">%s</span><span class="s1">)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">table_name</span><span class="p">,</span> <span class="n">refs</span><span class="p">)</span>
<span class="k">return</span> <span class="n">sql</span>
<span class="k">def</span> <span class="nf">copy_table</span><span class="p">(</span><span class="n">cursor</span><span class="p">,</span> <span class="n">table</span><span class="p">):</span>
<span class="sd">"""Add a dbase table to an open sqlite database"""</span>
<span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"drop table if exists </span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="n">table</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="n">fields</span> <span class="o">=</span> <span class="n">get_fields</span><span class="p">(</span><span class="n">table</span><span class="p">)</span>
<span class="n">sql</span> <span class="o">=</span> <span class="n">create_table_statement</span><span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">fields</span><span class="p">)</span>
<span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span>
<span class="n">sql</span> <span class="o">=</span> <span class="n">insert_table_statement</span><span class="p">(</span><span class="n">table</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">fields</span><span class="p">)</span>
<span class="k">for</span> <span class="n">rec</span> <span class="ow">in</span> <span class="n">table</span><span class="p">:</span>
<span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">,</span> <span class="nb">list</span><span class="p">(</span><span class="n">rec</span><span class="o">.</span><span class="n">values</span><span class="p">()))</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">output_file</span> <span class="o">=</span> <span class="s2">"himalayan_database.sqlite"</span>
<span class="n">tables</span> <span class="o">=</span> <span class="p">[</span><span class="s2">"exped"</span><span class="p">,</span> <span class="s2">"members"</span><span class="p">,</span> <span class="s2">"peaks"</span><span class="p">,</span> <span class="s2">"refer"</span><span class="p">]</span>
<span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">output_file</span><span class="p">)</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="k">for</span> <span class="n">table_name</span> <span class="ow">in</span> <span class="n">tables</span><span class="p">:</span>
<span class="n">table_file</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"download/Himalayan Database/HIMDATA/</span><span class="si">{table_name}</span><span class="s2">.DBF"</span>
<span class="n">dbf_table</span> <span class="o">=</span> <span class="n">DBF</span><span class="p">(</span>
<span class="n">table_file</span><span class="p">,</span> <span class="n">lowernames</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">char_decode_errors</span><span class="o">=</span><span class="s2">"strict"</span>
<span class="p">)</span>
<span class="n">copy_table</span><span class="p">(</span><span class="n">cursor</span><span class="p">,</span> <span class="n">dbf_table</span><span class="p">)</span>
<span class="n">conn</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre><p>The database has four tables:</p>
<p><img src="/static/2021/himalayan-database-schema.png" alt=""></p>
<ul>
<li><p>The <strong>peaks</strong> table has one record fore each mountaineering peaks of Nepal</p>
</li>
<li><p>The <strong>exped</strong> table has one record describing each of the climbing expeditions.</p>
</li>
<li><p>The <strong>members</strong> table describes each of the members on the climbing team and hired personnel who were significantly involved in the expedition, one record for each member.</p>
</li>
<li><p>The <strong>refer</strong> table describes the literature references for each expedition, primarily major books, journal and magazine articles, and website links, one record for each reference.</p>
</li>
</ul>
<p>You can now use <a href="">DB Browser for SQLite</a> to inspect the data. The <a href="https://www.himalayandatabase.com/downloads/Appendix%20J%20-%20SQL%20Searches.pdf">Appendix
J: SQL Searches</a> of the Himalayan database documentation gives some ideas of
interesting queries on the data (in FoxPro SQL language). In a follow up blog
post I'm goring to describe the database schema and field contents a little
bit more into detail and also show some nice insights into historic Himalaya
expeditions.</p>
<p><img src="/static/2021/sqlitebrowser.png" alt=""></p>
Azure Synapse SQL-on-Demand Openrowset Common Table Expression with SQLAlchemyhttp://peter-hoffmann.com/2020/azure-synapse-sql-on-demand-openrowset-common-table-expression-with-sqlalchemy.html2020-09-27T00:00:00Z2020-09-27T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p>In a previous post I have shown how to use <a href="http://peter-hoffmann.com/2020/clients-and-data-access-with-turbodbc-to-azure-synapse-sql-on-demand.html">turbodbc to access Azure Synapse SQL-on-Demand endpoints</a>. A common pattern is to use the <a href="https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset">openrowset</a> function to query parquet data from an external data source like the azure blob storage:</p>
<pre class="highlight"><code><span class="n">select</span>
<span class="n">result</span><span class="o">.</span><span class="n">filepath</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">c_date</span><span class="p">],</span>
<span class="o">*</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/c_date=*/*.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="k">with</span><span class="p">(</span>
<span class="p">[</span><span class="n">l_id</span><span class="p">]</span> <span class="n">bigint</span><span class="p">,</span>
<span class="p">[</span><span class="n">sales_euro</span><span class="p">]</span> <span class="nb">float</span><span class="p">,</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">result</span><span class="p">]</span>
<span class="n">where</span> <span class="n">c_date</span><span class="o">=</span><span class="s1">'2020-09-01'</span>
</code></pre><p><a href="https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-ctas?toc=/azure/synapse-analytics/toc.json&bc=/azure/synapse-analytics/breadcrumb/toc.json">Common table expressions</a> help to make the sql code more readable, especially if more than one external data source is queried. Once you have defined the CTE statements at the top you can use them like normal tables
inside your queries:</p>
<pre class="highlight"><code><span class="n">WITH</span> <span class="n">location</span> <span class="n">AS</span>
<span class="p">(</span><span class="n">SELECT</span>
<span class="o">*</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="k">with</span><span class="p">(</span>
<span class="p">[</span><span class="n">l_id</span><span class="p">]</span> <span class="n">bigint</span><span class="p">,</span>
<span class="p">[</span><span class="n">l_name</span><span class="p">]</span> <span class="n">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
<span class="p">[</span><span class="n">latitude</span><span class="p">]</span> <span class="nb">float</span><span class="p">,</span>
<span class="p">[</span><span class="n">longitude</span><span class="p">]</span> <span class="nb">float</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">result</span><span class="p">]</span>
<span class="p">),</span>
<span class="n">sales</span> <span class="n">AS</span>
<span class="p">(</span><span class="n">SELECT</span>
<span class="n">result</span><span class="o">.</span><span class="n">filepath</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">c_date</span><span class="p">],</span>
<span class="o">*</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/c_date=*/*.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="k">with</span><span class="p">(</span>
<span class="p">[</span><span class="n">l_id</span><span class="p">]</span> <span class="n">bigint</span><span class="p">,</span>
<span class="p">[</span><span class="n">sales_euro</span><span class="p">]</span> <span class="nb">float</span><span class="p">,</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">result</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">SELECT</span> <span class="n">location</span><span class="o">.</span><span class="n">l_id</span><span class="p">,</span> <span class="n">sales</span><span class="o">.</span><span class="n">sales_euro</span>
<span class="n">FROM</span> <span class="n">sales</span> <span class="n">JOIN</span> <span class="n">location</span> <span class="n">ON</span> <span class="n">sales</span><span class="o">.</span><span class="n">l_id</span> <span class="o">=</span> <span class="n">location</span><span class="o">.</span><span class="n">l_id</span>
<span class="n">where</span> <span class="n">c_date</span> <span class="o">=</span> <span class="s1">'2020-01-01'</span>
</code></pre><p>Still writing such queries in data pipelines soon becomes cumbersome end error prone. So once we moved from
writing the queries in the Azure Synapse Workbench to using them in our daily workflows with python, we wanted to
have a better way to programmatically generate the SQL statements.</p>
<p><a href="https://www.sqlalchemy.org">SQLAlchemy</a> is still our library of choice to work with SQL in python. SQLAlchemy
already has support for <a href="https://docs.sqlalchemy.org/en/13/dialects/mssql.html">Microsoft SQL Server</a> so
most of the Azure Synapse SQL-on-Demand features are covered. I have not yet found a native way
to work with <code>openrowset</code> queries, but it's quite easy to use the <code>text()</code> feature to inject the missing statement</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlalchemy</span> <span class="k">as</span> <span class="nn">sa</span>
<span class="n">cte_location_raw</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1">*</span>
<span class="s1">FROM</span>
<span class="s1"> OPENROWSET(</span>
<span class="s1"> BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',</span>
<span class="s1"> FORMAT='PARQUET'</span>
<span class="s1"> ) with(</span>
<span class="s1"> [l_id] bigint,</span>
<span class="s1"> [l_name] varchar(100),</span>
<span class="s1"> [latitude] float,</span>
<span class="s1"> [longitude] float</span>
<span class="s1"> ) as [result]</span>
<span class="s1">'''</span>
<span class="n">cte</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="n">sa</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">cte_location_raw</span><span class="p">)])</span><span class="o">.</span><span class="n">cte</span><span class="p">(</span><span class="s1">'location'</span><span class="p">)</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="n">sa</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'l_id'</span><span class="p">),</span> <span class="n">sa</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'l_code'</span><span class="p">),</span> <span class="n">sa</span><span class="o">.</span><span class="n">column</span><span class="p">(</span><span class="s1">'l_name'</span><span class="p">)])</span><span class="o">.</span><span class="n">select_from</span><span class="p">(</span><span class="n">cte</span><span class="p">)</span>
</code></pre><p>The <a href="https://docs.sqlalchemy.org/en/13/core/selectable.html#sqlalchemy.sql.expression.cte">cte</a> returns a
Common Table Expression instance which is a subclass of the BaseSelect SELECT statement and can be used
as such in other statements to generate the following code:</p>
<pre class="highlight"><code><span class="n">WITH</span> <span class="n">location</span> <span class="n">AS</span>
<span class="p">(</span><span class="n">SELECT</span>
<span class="o">*</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="k">with</span><span class="p">(</span>
<span class="p">[</span><span class="n">l_id</span><span class="p">]</span> <span class="n">bigint</span><span class="p">,</span>
<span class="p">[</span><span class="n">l_name</span><span class="p">]</span> <span class="n">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
<span class="p">[</span><span class="n">latitude</span><span class="p">]</span> <span class="nb">float</span><span class="p">,</span>
<span class="p">[</span><span class="n">longitude</span><span class="p">]</span> <span class="nb">float</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">result</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">SELECT</span> <span class="n">l_id</span><span class="p">,</span> <span class="n">l_code</span><span class="p">,</span> <span class="n">l_name</span> <span class="n">FROM</span> <span class="n">location</span>
</code></pre><p>The cte statement does not know about it's columns because it only gets passed the raw sql text. But you can
annotate the <code>sa.text</code> statement with a <code>typemap</code> dictionary, so that it exposes
which columns are available from the statement. By annotating the cte we can use the <code>table.c.column</code> statement
later to reference the columns instead of using <code>sa.column('l_code')</code> as above.</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlalchemy</span> <span class="k">as</span> <span class="nn">sa</span>
<span class="n">cte_location_raw</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1">*</span>
<span class="s1">FROM</span>
<span class="s1"> OPENROWSET(</span>
<span class="s1"> BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',</span>
<span class="s1"> FORMAT='PARQUET'</span>
<span class="s1"> ) with(</span>
<span class="s1"> [l_id] bigint,</span>
<span class="s1"> [l_name] varchar(100),</span>
<span class="s1"> [latitude] float,</span>
<span class="s1"> [longitude] float</span>
<span class="s1"> ) as [result]</span>
<span class="s1">'''</span>
<span class="n">typemap</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"l_id"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Integer</span><span class="p">,</span> <span class="s2">"l_code"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">String</span><span class="p">,</span> <span class="s2">"l_name"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">String</span><span class="p">,</span> <span class="s2">"latitude"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Float</span><span class="p">,</span> <span class="s2">"longitude"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Float</span><span class="p">}</span>
<span class="n">cte</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="n">sa</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">cte_location_raw</span><span class="p">,</span> <span class="n">typemap</span><span class="o">=</span><span class="n">typemap</span><span class="p">)])</span><span class="o">.</span><span class="n">cte</span><span class="p">(</span><span class="s1">'location'</span><span class="p">)</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="n">cte</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">l_id</span><span class="p">,</span> <span class="n">cte</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">l_name</span><span class="p">])</span><span class="o">.</span><span class="n">select_from</span><span class="p">(</span><span class="n">cte</span><span class="p">)</span>
</code></pre><p>So putting everything together you can define and test your CTEs in python</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">sqlalchemy</span> <span class="k">as</span> <span class="nn">sa</span>
<span class="n">cte_sales_raw</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1">SELECT</span>
<span class="s1"> result.filepath(1) as [c_date],</span>
<span class="s1"> *</span>
<span class="s1">FROM</span>
<span class="s1"> OPENROWSET(</span>
<span class="s1"> BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/*.parquet',</span>
<span class="s1"> FORMAT='PARQUET'</span>
<span class="s1"> ) with(</span>
<span class="s1"> [l_id] bigint,</span>
<span class="s1"> [sales_euro] float,</span>
<span class="s1"> ) as [result]</span>
<span class="s1">'''</span>
<span class="n">cte_location_raw</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1">SELECT</span>
<span class="s1"> *</span>
<span class="s1">FROM</span>
<span class="s1"> OPENROWSET(</span>
<span class="s1"> BULK 'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet',</span>
<span class="s1"> FORMAT='PARQUET'</span>
<span class="s1"> ) with(</span>
<span class="s1"> [l_id] bigint,</span>
<span class="s1"> [l_name] varchar(100),</span>
<span class="s1"> [latitude] float,</span>
<span class="s1"> [longitude] float</span>
<span class="s1"> ) as [result]</span>
<span class="s1">'''</span>
<span class="n">typemap_location</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"l_id"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Integer</span><span class="p">,</span> <span class="s2">"l_name"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">String</span><span class="p">,</span> <span class="s2">"latitude"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Float</span><span class="p">,</span> <span class="s2">"longitude"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Float</span><span class="p">}</span>
<span class="n">location</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="n">sa</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">cte_location_raw</span><span class="p">,</span> <span class="n">typemap</span><span class="o">=</span><span class="n">typemap_location</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"tmp1"</span><span class="p">)])</span><span class="o">.</span><span class="n">cte</span><span class="p">(</span><span class="s1">'location'</span><span class="p">)</span>
<span class="n">typemap_sales</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"l_id"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Integer</span><span class="p">,</span> <span class="s2">"c_date"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Date</span><span class="p">,</span> <span class="s2">"sales_euro"</span><span class="p">:</span> <span class="n">sa</span><span class="o">.</span><span class="n">Float</span><span class="p">}</span>
<span class="n">sales</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">([</span><span class="n">sa</span><span class="o">.</span><span class="n">text</span><span class="p">(</span><span class="n">cte_sales_raw</span><span class="p">,</span> <span class="n">typemap</span><span class="o">=</span><span class="n">typemap_sales</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"tmp2"</span><span class="p">)])</span><span class="o">.</span><span class="n">cte</span><span class="p">(</span><span class="s1">'sales'</span><span class="p">)</span>
</code></pre><p>and then compose more complex statements like with any other SQLAlchemy table definitions:</p>
<pre class="highlight"><code><span class="n">cols</span> <span class="o">=</span> <span class="p">[</span><span class="n">sales</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">c_date</span><span class="p">,</span> <span class="n">sales</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">l_id</span><span class="p">,</span> <span class="n">location</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">l_name</span><span class="p">,</span> <span class="n">location</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">latitude</span><span class="p">,</span> <span class="n">location</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">longitude</span><span class="p">]</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">sa</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span><span class="o">.</span><span class="n">select_from</span><span class="p">(</span><span class="n">sales</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">location</span><span class="p">,</span> <span class="n">sales</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">l_id</span> <span class="o">==</span> <span class="n">location</span><span class="o">.</span><span class="n">c</span><span class="o">.</span><span class="n">l_id</span> <span class="p">))</span>
</code></pre><p>In our production data pipelins at <a href="https://blueyonder.com">Blue Yonder</a> we typically provide the
building blocks to create complex queries in libraries that are maintained by a central team.
Testing smaller parts with SQLAlchemy works much better and it's easier for data scientists to
plug them together and concentrate on high level model logic.</p>
<p>We like the power of Azure SQL-on-Demand, but managing and testing complex SQL statements is
still a challenge as you can already see by the result of the above code. But at least
SQLAlchemy and Python make it easier:</p>
<pre class="highlight"><code><span class="n">WITH</span> <span class="n">sales</span> <span class="n">AS</span>
<span class="p">(</span><span class="n">SELECT</span> <span class="n">l_id</span> <span class="n">AS</span> <span class="n">l_id</span><span class="p">,</span> <span class="n">c_date</span> <span class="n">AS</span> <span class="n">c_date</span><span class="p">,</span> <span class="n">sales_euro</span> <span class="n">AS</span> <span class="n">sales_euro</span>
<span class="n">FROM</span> <span class="p">(</span>
<span class="n">SELECT</span>
<span class="n">result</span><span class="o">.</span><span class="n">filepath</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">c_date</span><span class="p">],</span>
<span class="o">*</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://<storage_account>.dfs.core.windows.net/<filesystem>/sales/table/*.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="k">with</span><span class="p">(</span>
<span class="p">[</span><span class="n">l_id</span><span class="p">]</span> <span class="n">bigint</span><span class="p">,</span>
<span class="p">[</span><span class="n">sales_euro</span><span class="p">]</span> <span class="nb">float</span><span class="p">,</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">result</span><span class="p">]</span>
<span class="p">)</span><span class="k">as</span> <span class="n">tmp1</span><span class="p">),</span>
<span class="n">location</span> <span class="n">AS</span>
<span class="p">(</span><span class="n">SELECT</span> <span class="n">l_id</span> <span class="n">AS</span> <span class="n">l_id</span><span class="p">,</span> <span class="n">l_name</span> <span class="n">AS</span> <span class="n">l_name</span><span class="p">,</span> <span class="n">latitude</span> <span class="n">AS</span> <span class="n">latitude</span><span class="p">,</span> <span class="n">longitude</span> <span class="n">AS</span> <span class="n">longitude</span>
<span class="n">FROM</span> <span class="p">(</span>
<span class="n">SELECT</span>
<span class="o">*</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://<storage_account>.dfs.core.windows.net/<filesystem>/location/table/*.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="k">with</span><span class="p">(</span>
<span class="p">[</span><span class="n">l_id</span><span class="p">]</span> <span class="n">bigint</span><span class="p">,</span>
<span class="p">[</span><span class="n">l_name</span><span class="p">]</span> <span class="n">varchar</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
<span class="p">[</span><span class="n">latitude</span><span class="p">]</span> <span class="nb">float</span><span class="p">,</span>
<span class="p">[</span><span class="n">longitude</span><span class="p">]</span> <span class="nb">float</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">result</span><span class="p">]</span>
<span class="p">)</span> <span class="k">as</span> <span class="n">tmp2</span><span class="p">)</span>
<span class="n">SELECT</span> <span class="n">sales</span><span class="o">.</span><span class="n">c_date</span><span class="p">,</span> <span class="n">sales</span><span class="o">.</span><span class="n">l_id</span><span class="p">,</span> <span class="n">location</span><span class="o">.</span><span class="n">l_name</span><span class="p">,</span> <span class="n">location</span><span class="o">.</span><span class="n">latitude</span><span class="p">,</span> <span class="n">location</span><span class="o">.</span><span class="n">longitude</span>
<span class="n">FROM</span> <span class="n">sales</span> <span class="n">JOIN</span> <span class="n">location</span> <span class="n">ON</span> <span class="n">sales</span><span class="o">.</span><span class="n">l_id</span> <span class="o">=</span> <span class="n">location</span><span class="o">.</span><span class="n">l_id</span>
</code></pre>Using turbodbc to access Azure Synapse SQL-on-Demand endpointshttp://peter-hoffmann.com/2020/clients-and-data-access-with-turbodbc-to-azure-synapse-sql-on-demand.html2020-05-25T00:00:00Z2020-05-25T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<h1>ODBC access via turbodbc/python</h1>
<p><a href="https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview">Azure Synapse SQL-on-Demand</a>
pools can be accessed though an odbc compatible client from python.</p>
<p>First you need to grant access to the sql endpoint for an external DB user:</p>
<pre class="highlight"><code><span class="n">CREATE</span> <span class="n">LOGIN</span> <span class="n">testuser</span> <span class="n">WITH</span> <span class="n">password</span><span class="o">=</span><span class="s1">'xxx'</span><span class="p">;</span>
</code></pre><p>SQL Analytics on-demand query reads files directly from Azure Storage. Since
the storage account is an object that is external to SQL Analytics on-demand,
appropriate credentials are required. A user needs the appropriate
permissions granted to use the requisite credential.</p>
<p>Delegation of access to Azure blob storage accounts can be done with <a href="https://github.com/Azure/azure-synapse-analytics/blob/2e1e440d3ffd3007155b1658118779cbc1e59b73/sql-analytics/development-storage-files-storage-access-control.md">AAD pass-through or giving manual credentials </a></p>
<pre class="highlight"><code><span class="n">CREATE</span> <span class="n">CREDENTIAL</span> <span class="p">[</span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">blob</span><span class="o">.</span><span class="n">dfs</span><span class="o">.</span><span class="n">core</span><span class="o">.</span><span class="n">windows</span><span class="o">.</span><span class="n">net</span><span class="o">/</span><span class="n">benchmark</span><span class="p">]</span>
<span class="n">WITH</span> <span class="n">IDENTITY</span><span class="o">=</span><span class="s1">'SHARED ACCESS SIGNATURE'</span>
<span class="p">,</span> <span class="n">SECRET</span> <span class="o">=</span> <span class="s1">'sv=2018-03-28xxxx'</span>
<span class="n">GO</span>
<span class="n">GRANT</span> <span class="n">REFERENCES</span> <span class="n">ON</span> <span class="n">CREDENTIAL</span><span class="p">::[</span><span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">blob</span><span class="o">.</span><span class="n">dfs</span><span class="o">.</span><span class="n">core</span><span class="o">.</span><span class="n">windows</span><span class="o">.</span><span class="n">net</span><span class="o">/</span><span class="n">benchmark</span><span class="p">]</span> <span class="n">TO</span> <span class="p">[</span><span class="n">testuser</span><span class="p">];</span>
</code></pre><p>To connect to an Azure SQL-on-Demand endpoint you need to follow the installation of the <a href="https://docs.microsoft.com/de-de/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver15#debian17">ODBC driver for debian</a>.</p>
<pre class="highlight"><code><span class="n">curl</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">packages</span><span class="o">.</span><span class="n">microsoft</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">keys</span><span class="o">/</span><span class="n">microsoft</span><span class="o">.</span><span class="n">asc</span> <span class="o">|</span> <span class="n">apt</span><span class="o">-</span><span class="n">key</span> <span class="n">add</span> <span class="o">-</span>
<span class="n">curl</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">packages</span><span class="o">.</span><span class="n">microsoft</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">config</span><span class="o">/</span><span class="n">debian</span><span class="o">/</span><span class="mi">9</span><span class="o">/</span><span class="n">prod</span><span class="o">.</span><span class="n">list</span> <span class="o">></span> <span class="o">/</span><span class="n">etc</span><span class="o">/</span><span class="n">apt</span><span class="o">/</span><span class="n">sources</span><span class="o">.</span><span class="n">list</span><span class="o">.</span><span class="n">d</span><span class="o">/</span><span class="n">mssql</span><span class="o">-</span><span class="n">release</span><span class="o">.</span><span class="n">list</span>
<span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">update</span>
<span class="n">ACCEPT_EULA</span><span class="o">=</span><span class="n">Y</span> <span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">install</span> <span class="n">msodbcsql17</span>
<span class="n">ACCEPT_EULA</span><span class="o">=</span><span class="n">Y</span> <span class="n">apt</span><span class="o">-</span><span class="n">get</span> <span class="n">install</span> <span class="n">mssql</span><span class="o">-</span><span class="n">tools</span>
<span class="n">echo</span> <span class="s1">'export PATH="$PATH:/opt/mssql-tools/bin"'</span> <span class="o">>></span> <span class="o">~/.</span><span class="n">bash_profile</span>
<span class="n">echo</span> <span class="s1">'export PATH="$PATH:/opt/mssql-tools/bin"'</span> <span class="o">>></span> <span class="o">~/.</span><span class="n">bashrc</span>
</code></pre><p>Now you can connect with <a href="https://turbodbc.readthedocs.io/en/latest/">turbodbc</a> to the SQL-on-demand pool to execute your queries:</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">turbodbc</span>
<span class="n">server</span> <span class="o">=</span><span class="s1">'mysynapse-ondemand.sql.azuresynapse.net'</span>
<span class="n">port</span> <span class="o">=</span> <span class="mi">1433</span>
<span class="n">database</span><span class="o">=</span><span class="s2">"master"</span>
<span class="n">uid</span><span class="o">=</span><span class="s2">"testuser"</span>
<span class="n">pwd</span><span class="o">=</span><span class="s2">"xxx"</span>
<span class="n">con</span> <span class="o">=</span> <span class="n">turbodbc</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">driver</span><span class="o">=</span><span class="s1">'ODBC Driver 17 for SQL Server'</span><span class="p">,</span>
<span class="n">server</span><span class="o">=</span><span class="n">server</span><span class="p">,</span>
<span class="n">port</span><span class="o">=</span><span class="n">port</span><span class="p">,</span>
<span class="n">database</span><span class="o">=</span><span class="n">database</span><span class="p">,</span>
<span class="n">uid</span><span class="o">=</span><span class="n">uid</span><span class="p">,</span>
<span class="n">pwd</span><span class="o">=</span><span class="n">pwd</span><span class="p">)</span>
<span class="n">stm</span> <span class="o">=</span> <span class="s1">'''</span>
<span class="s1">SELECT</span>
<span class="s1"> TOP 100 *</span>
<span class="s1">FROM</span>
<span class="s1"> OPENROWSET(</span>
<span class="s1"> BULK 'https://blob.dfs.core.windows.net/benchmark/*/01.parquet',</span>
<span class="s1"> FORMAT='PARQUET'</span>
<span class="s1"> ) AS [r];</span>
<span class="s1">'''</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">con</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="n">cur</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">stm</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">cur</span><span class="o">.</span><span class="n">fetchall</span><span class="p">())</span>
</code></pre><p>You can also use all the PyArrow/Pandas features in turbodbc to efficiently run workflows for data intensive machine learning applications.</p>
<pre class="highlight"><code><span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">stm</span><span class="p">)</span>
<span class="n">table</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">fetchallarrow</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="n">to_pandas</span><span class="p">()</span>
</code></pre><h1>Clients</h1>
<p>In addition to the odbc interface Azure offers a web and a desktop client to run ad hoc queries on the SQL service.</p>
<h2>Azure Synapse Studio</h2>
<p>Azure Synapse Studio is the integrated web client to interact with an Azure Synapse
Workspace. It offers an online sql script editor and a browser for Azure Blob Storage
Accounts. Based on parquet file inspection it can infer schemata and generate
create external tables for parquet data in the storage accounts.</p>
<p>Access to the Workspace is based on the azure managed identities (AAD). Permissions
can be granted to the SQL pools in the workspace. During creation of the
workspace one can grant the managed identity CONTROL permissions on SQL
pools.</p>
<p><img src="/static/2020/azure-synapse-studio.png" alt="Azure Synapse Studio"></p>
<p>Azure Synapse Studio offers keyword completion, syntax highlighting and some
keyboard shortcuts. Run on-demand SQL queries, view and save results
as CSV export.</p>
<h2>Azure Data Studio</h2>
<p><a href="https://docs.microsoft.com/de-de/sql/azure-data-studio/">Azure Data Studio</a>
is a cross platform sql editor and database tool from Microsoft. It supports
connecting to a Azure Synapse SQL on Demand server through the managed azure
identities (AAD).</p>
<p><img src="/static/2020/azure-data-studio.png" alt="Azure Data Studio"></p>
<p>Azure Data Studio offers multiple tab windows, a rich SQL editor,
IntelliSense, keyword completion, code snippets, code navigation, and source
control integration (Git). It can run on-demand SQL queries, view and save results
as CSV, JSON, or Excel.</p>
DuckDB vs Azure Synapse SQL-on-Demand with parquethttp://peter-hoffmann.com/2020/duckdb-vs-azure-synapse-sql-on-demand.html2020-05-25T00:00:00Z2020-05-25T00:00:00ZPeter Hoffmannhttp://peter-hoffmann.com<p><strong>Short Disclaimer:</strong> This post is comparing apples to oranges, because
<a href="https://www.duckdb.org">DuckDB</a> is an embedded database designed to execute
analytical SQL queries on your local machine, whereas <a href="https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview">Azure Synapse
SQL-on-Demand</a>
is Microsoft new cloud offering for a serverless SQL on-demand (preview)
endpoint to query data in the Azure Blob Store/Data Lake.</p>
<p>But in the end I was still triggerd by <a href="https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html">Uwe Korn's post: Taking DuckDB for a
spin</a> to do a
comparison, because even if they are fundamentally different offerings, from
a high level both solutions offer a simple way to run analytical queries on
datasets without the need to install/manage a server.</p>
<h3>Load the data as parquet data</h3>
<p>To get started you need to convert the yellow tripdata sampledata to a parquet
file and upload it to the Azure Blob Storage:</p>
<pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">simplekv.net.azurestore</span> <span class="kn">import</span> <span class="n">AzureBlockBlobStore</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s2">"yellow_tripdata_2016-01.csv"</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span>
<span class="n">filename</span><span class="p">,</span>
<span class="n">dtype</span><span class="o">=</span><span class="p">{</span><span class="s2">"store_and_fwd_flag"</span><span class="p">:</span> <span class="s2">"bool"</span><span class="p">},</span>
<span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s2">"tpep_pickup_datetime"</span><span class="p">,</span> <span class="s2">"tpep_dropoff_datetime"</span><span class="p">],</span>
<span class="n">index_col</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">infer_datetime_format</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">true_values</span><span class="o">=</span><span class="p">[</span><span class="s2">"Y"</span><span class="p">],</span>
<span class="n">false_values</span><span class="o">=</span><span class="p">[</span><span class="s2">"N"</span><span class="p">],</span>
<span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="s2">"yellow_tripdata_2016-01.parquet"</span><span class="p">)</span>
<span class="n">conn_string</span> <span class="o">=</span> <span class="s1">'DefaultEndpointsProtocol=https;AccountName=blob;AccountKey=xxx;'</span>
<span class="n">store</span> <span class="o">=</span> <span class="n">AzureBlockBlobStore</span><span class="p">(</span><span class="n">conn_string</span><span class="o">=</span><span class="n">conn_string</span><span class="p">,</span> <span class="n">container</span><span class="o">=</span><span class="s1">'benchmark'</span><span class="p">,</span> <span class="n">public</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">store</span><span class="o">.</span><span class="n">putfile</span><span class="p">(</span><span class="s2">"yellow_tripdata_2016-01.parquet"</span><span class="p">,</span> <span class="s2">"yellow_tripdata_2016-01.parquet"</span><span class="p">)</span>
</code></pre><p>Once the data is available in the Azure Blob Storage/Data Lake you can use the <a href="https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset">openrowset</a> function
to read the remote dataset in Azure SQL-on-Demand:</p>
<h2>Count Distinct</h2>
<p>The minimal count distinct example needs to once read all selected columns from the azure blob storage and execute the count on the data.</p>
<p><img src="/static/2020/azure-sql-on-demand.png" alt=""></p>
<p>The query roughly takes <code>4s</code>, which is not too bad and quite faster than the <code>5.58s</code> for DuckDB and the <code>25s</code> for sqlite.</p>
<h2>Frequency of events</h2>
<p>This type of query is a nice and simple form for testing the aggregation performance.</p>
<pre class="highlight"><code><span class="n">SELECT</span>
<span class="n">MIN</span><span class="p">(</span><span class="n">cnt</span><span class="p">),</span>
<span class="n">AVG</span><span class="p">(</span><span class="n">cnt</span><span class="p">),</span>
<span class="o">--</span> <span class="n">MEDIAN</span><span class="p">(</span><span class="n">cnt</span><span class="p">),</span>
<span class="n">MAX</span><span class="p">(</span><span class="n">cnt</span><span class="p">)</span>
<span class="n">FROM</span>
<span class="p">(</span>
<span class="n">SELECT</span>
<span class="n">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">as</span> <span class="n">cnt</span>
<span class="n">FROM</span>
<span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="n">AS</span> <span class="p">[</span><span class="n">r</span><span class="p">]</span>
<span class="n">GROUP</span> <span class="n">BY</span>
<span class="n">DATEPART</span><span class="p">(</span><span class="n">day</span><span class="p">,</span> <span class="n">tpep_pickup_datetime</span><span class="p">),</span>
<span class="n">DATEPART</span><span class="p">(</span><span class="n">hour</span><span class="p">,</span> <span class="n">tpep_pickup_datetime</span><span class="p">)</span>
<span class="p">)</span> <span class="k">as</span> <span class="p">[</span><span class="n">stats</span><span class="p">];</span>
</code></pre><p>As the query only has to read a subset of the columns it can be executed very fast in <code><1s</code>, which again beats DuckDB <code>2.05s</code> and SQLite <code>10.2s</code>.</p>
<h2>Simple fare regression</h2>
<pre class="highlight"><code><span class="k">with</span> <span class="n">yellow_tripdata_2016_01</span>
<span class="k">as</span> <span class="p">(</span><span class="n">select</span> <span class="o">*</span>
<span class="n">FROM</span> <span class="n">OPENROWSET</span><span class="p">(</span>
<span class="n">BULK</span> <span class="s1">'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet'</span><span class="p">,</span>
<span class="n">FORMAT</span><span class="o">=</span><span class="s1">'PARQUET'</span>
<span class="p">)</span> <span class="n">AS</span> <span class="p">[</span><span class="n">r</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">SELECT</span>
<span class="p">(</span><span class="n">SUM</span><span class="p">(</span><span class="n">trip_distance</span> <span class="o">*</span> <span class="n">fare_amount</span><span class="p">)</span> <span class="o">-</span> <span class="n">SUM</span><span class="p">(</span><span class="n">trip_distance</span><span class="p">)</span> <span class="o">*</span> <span class="n">SUM</span><span class="p">(</span><span class="n">fare_amount</span><span class="p">)</span> <span class="o">/</span> <span class="n">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">))</span> <span class="o">/</span>
<span class="p">(</span><span class="n">SUM</span><span class="p">(</span><span class="n">trip_distance</span> <span class="o">*</span> <span class="n">trip_distance</span><span class="p">)</span> <span class="o">-</span> <span class="n">SUM</span><span class="p">(</span><span class="n">trip_distance</span><span class="p">)</span> <span class="o">*</span> <span class="n">SUM</span><span class="p">(</span><span class="n">trip_distance</span><span class="p">)</span> <span class="o">/</span> <span class="n">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">))</span> <span class="n">AS</span> <span class="n">beta</span><span class="p">,</span>
<span class="n">AVG</span><span class="p">(</span><span class="n">fare_amount</span><span class="p">)</span> <span class="n">AS</span> <span class="n">avg_fare_amount</span><span class="p">,</span>
<span class="n">AVG</span><span class="p">(</span><span class="n">trip_distance</span><span class="p">)</span> <span class="n">AS</span> <span class="n">avg_trip_distance</span>
<span class="n">FROM</span>
<span class="n">yellow_tripdata_2016_01</span><span class="p">,</span>
<span class="p">(</span>
<span class="n">SELECT</span>
<span class="n">AVG</span><span class="p">(</span><span class="n">fare_amount</span><span class="p">)</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">*</span> <span class="n">STDEV</span><span class="p">(</span><span class="n">fare_amount</span><span class="p">)</span> <span class="k">as</span> <span class="n">max_fare</span><span class="p">,</span>
<span class="n">AVG</span><span class="p">(</span><span class="n">trip_distance</span><span class="p">)</span> <span class="o">+</span> <span class="mi">3</span> <span class="o">*</span> <span class="n">STDEV</span><span class="p">(</span><span class="n">trip_distance</span><span class="p">)</span> <span class="k">as</span> <span class="n">max_distance</span>
<span class="n">FROM</span> <span class="n">yellow_tripdata_2016_01</span>
<span class="p">)</span> <span class="n">AS</span> <span class="p">[</span><span class="n">sub</span><span class="p">]</span>
<span class="n">WHERE</span>
<span class="n">fare_amount</span> <span class="o">></span> <span class="mi">0</span> <span class="n">AND</span>
<span class="n">fare_amount</span> <span class="o"><</span> <span class="n">sub</span><span class="o">.</span><span class="n">max_fare</span> <span class="n">AND</span>
<span class="n">trip_distance</span> <span class="o">></span> <span class="mi">0</span> <span class="n">AND</span>
<span class="n">trip_distance</span> <span class="o"><</span> <span class="n">sub</span><span class="o">.</span><span class="n">max_distance</span>
<span class="p">;</span>
</code></pre><p>The result is fetched again in less than <code>1s</code> which is on par with DuckDB <code>972 ms</code> and beats SQLite <code>11.7 s</code>.</p>
<h2>Conclusion</h2>
<p>The performance comparison result that SQL-on-Demand beats DuckDB is just
irrelevant, because the two database solutions are completely different.</p>
<p>What's really interesting is that while DuckDB provides the comfort of an
easily embeddedable database on your local machine, Azure Synapse
SQL-on-Demand offers the same comfort on running T-SQL statements on parquet
files as a managed, elastic service in the cloud. You can just click your SQL
endpoint in Azure and start querying parquet files without the need to run or manage
a server and you only pay for data scanned (5$ per TB). If you are doing
data engineering in the Azure Cloud this is cool new tool in your toolbelt.</p>
<p>And one that scales horizontally. While the above examples have been done on
a tiny <code>1.7MB</code> parquet file, we have done some scaling tests at
<a href="http://blueyonder">blueyonder</a> up to terabytes of data and thousands of
parquet files. Stay tuned for a post on the results and an example of how to use <a href="https://turbodbc.readthedocs.io">turbodbc</a> as a high performance interface for data pipelines...</p>