Turbodbc: Turbocharged database access for data scientists
Michael is a senior software engineer at Blue Yonder GmbH. He holds a PhD in physics, practices test-driven development, and digs Clean Code in C++ and Python. In the last five years, he invested more money in table tennis gear than in smartphones.
Tags: numpy database python data-science analytics
Python's database API 2.0 is well suited for transactional database workflows, but not so much for column-heavy data science. This talk explains how the ODBC-based turbodbc database module extends this API with first-class, efficient support for familiar NumPy and Apache Arrow data structures.
This talk introduces the open source Python database module turbodbc. It uses standard ODBC drivers to connect with virtually any database and is a viable (and often faster) alternative to "native" Python drivers.
Briefly recounting the painful story of how data scientists previously used our analytics database, I explain why turbodbc was created and what distinguishes it from other ODBC modules. Sketching the flow of data from databases via drivers and Python modules to consumable Python objects, I motivate a few extensions to the standard database API 2.0 that turbodbc has made. These extensions heavily use NumPy arrays and Apache Arrow tables to provide data scientists with both familiar and efficient binary data structures they can further work on. I conclude my talk with benchmark results for a few databases.