Bridging the gap between databases and data science
In relational databases you can store information and data without losing the relationships between those data. They are therefore useful tools for computer scientists. There is, however, a gap between the relational database research community and data scientists. This leads to inefficient use of databases in data science. PhD-student Mark Raasveldt tried to bridge this gap. PhD defense 9 June 2020.
Integration with analytical tools
Most data scientists use analytical tools, such as R, Python and C/C++, for their research. These tools are difficult to integrate with current database systems, resulting in slow and cumbersome data analysis. ‘Data scientists have opted to reinvent database systems by developing a zoo of data management alternatives that perform similar tasks to classical database management systems, but have many of the problems that were solved in the database field decades ago,’ says Raasveldt.
‘The database research community has made tremendous strides in developing powerful database engines that allow for efficient analytical query processing.’ Raasveldt tried to combine these innovations in database science with the analytical tools that are mostly used by data scientists. ‘We investigate how we can facilitate efficient and painless integration of analytical tools and relational database management systems,’ says Raasveldt.
Large datasets
Another issue with the use of standard database systems in computer science is the size of the data that is handled. Most database systems are not optimised for large data sets and large-scale data analysis using remote servers. To optimise the systems, Raasveldt considered three methods.
‘We focused on the three primary methods for database-client integration: client-server connections, in-database processing and embedding the database inside the client application,’ Raasveldt explains. For every method, he studied the implementations in existing database systems and he evaluated how efficient they are for the large datasets and workloads that are common in data science.
DuckDB
Raasveldts final result was a new data management system, called DuckDB, that was purpose-built for efficient and painless integration with R and Python (and other analytical tools). This management system is meant to be used as a mature database system that is not solely used for research purposes.
‘In DuckDB we take all the lessons that we have learned investigating database-client integrations and create an easy-to-use and highly efficient embedded database.’ Raasveldt will continue his work as a postdoc at the CWI, where he will further develop DuckDB.