InferDB: In-Database Machine Learning Inference Using Indexes (2024)

research-article

Artifacts Available / v1.1

Authors:
Ricardo Salazar-Díaz Hasso Plattner Institute, University of Potsdam

Hasso Plattner Institute, University of Potsdam
Search about this author

,
Boris Glavic University of Illinois Chicago

University of Illinois Chicago
Search about this author

,
Tilmann Rabl Hasso Plattner Institute, University of Potsdam

Hasso Plattner Institute, University of Potsdam
Search about this author

Proceedings of the VLDB EndowmentVolume 17Issue 8pp 1830–1842https://doi.org/10.14778/3659437.3659441

Published:31 May 2024Publication History

0citation
6
Downloads

Metrics

Total Citations0Total Downloads6

Last 12 Months6

Last 6 weeks6

Get Citation Alerts
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
Publisher Site

Get Access

Proceedings of the VLDB Endowment

Volume 17, Issue 8

PreviousArticleNextArticle

Skip Abstract Section

Abstract

The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing workflow followed by prediction with an ML model. Current approaches for in-database inference implement preprocessing operators and ML algorithms in the database either natively, by transpiling code to SQL, or by executing user-defined functions in guest languages such as Python. In this work, we present a radically different approach that approximates an end-to-end inference pipeline (preprocessing plus prediction) using a light-weight embedding that discretizes a carefully selected subset of the input features and an index that maps data points in the embedding space to aggregated predictions of an ML model. We replace a complex preprocessing workflow and model-based inference with a simple feature transformation and an index lookup. Our framework improves inference latency by several orders of magnitude while maintaining similar prediction accuracy compared to the pipeline it approximates.

References

Rakesh Agrawal and Kyuseok Shim. 1996. Developing Tightly-Coupled Data Mining Applications on a Relational Database System. In SIGKDD. 287--290.Google Scholar
Amazon. 2020. Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML. Retrieved April 11, 2024 from https://aws.amazon.com/de/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/Google Scholar
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509--517.Google ScholarDigital Library
Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. PVLDB 9, 13 (2016), 1425--1436.Google ScholarDigital Library
İlkay Çınar and Murat Koklu. 2022. Identification of Rice Varieties Using Machine Learning Algorithms. Journal of Agricultural Sciences 28, 2 (2022), 307 -- 325. Google ScholarCross Ref
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 785--794. Google ScholarDigital Library
T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1 (1967), 21--27.Google ScholarDigital Library
Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. 2011. Fast locality-sensitive hashing. In SIGKDD. 1073--1081.Google Scholar
James Dougherty, Ron Kohavi, and Mehran Sahami. 1995. Supervised and Unsupervised Discretization of Continuous Features. In ICML. 194--202.Google Scholar
Usama M. Fayyad and Keki B. Irani. 1993. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In IJCAI. 1022--1029.Google Scholar
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a Unified Architecture for In-RDBMS Analytics. In SIGMOD. 325--336.Google Scholar
Centers for Disease Control and Prevention. 2023. Daily Census Tract-level PM2.5 concentrations, 2016-2020. Retrieved April 11, 2024 from https://healthdata.gov/dataset/Daily-Census-Tract-Level-PM2-5-Concentrations-2016/k9st-jhz8/dataGoogle Scholar
Walid G. and Ihab F. Ilyas. 2001. SP-GIST: An extensible database index for supporting Space Partitioning Trees. Journal of Intelligent Information Systems, 215--240.Google Scholar
Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesus Camacho-Rodriguez, and Matteo Interlandi. 2023. The Tensor Data Platform: Towards an AI-centric Database System. CIDR (2023).Google Scholar
Google. 2023. Make predictions with imported TensorFlow models. Retrieved April 11, 2024 from https://cloud.google.com/bigquery/docs/making-predictions-with-imported-tensorflow-models?hl=deGoogle Scholar
Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. Springer Berlin Heidelberg, 986--996. https://link.springer.com/chapter/10.1007/978-3-540-39964-3_62#citeasGoogle Scholar
Isabelle Guyon and André Elisseeff. 2003. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, null (2003), 1157--1182.Google Scholar
Omid Jafari and Parth Nagarkar. 2021. Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches. In ADC, Vol. 12610. 62--73.Google Scholar
Andy Jassy. 2018. AWS re:Invent 2018 keynote. Video. Retrieved April 11, 2024 from https://www.youtube.com/watch?v=ZOIkOnW640AGoogle Scholar
Harold Jeffreys. 1946. An invariant form for the prior probability in estimation problems. In Proceedings of the Royal Society A. Royal Society. Google ScholarCross Ref
Kannadasan K, Haresh M V, Ambati Rami Reddy, and B. Shameedha Begum. 2023. BCIRecog: An Optimized BCI System for Imagined Speech Recognition. In ICCCNT. 1--7. Google ScholarCross Ref
Kaggle. 2017. New York City taxi trip duration. Retrieved April 11, 2024 from https://www.kaggle.com/c/nyc-taxi-trip-durationGoogle Scholar
Kaggle. 2017. New York City taxi trip duration evaluation. Retrieved April 11, 2024 from https://www.kaggle.com/competitions/nyc-taxi-trip-duration/overview/evaluationGoogle Scholar
Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. ExploreKit: Automatic Feature Generation and Selection. In ICDM. 979--984.Google Scholar
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In NIPS (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 3149--3157.Google Scholar
Steffen Kläbe and Stefan Hagedorn. 2021. Applying Machine Learning Models to Scalable DataFrames with Grizzly. BTW 2021., 195--214 pages. Google ScholarCross Ref
Steffen Kläbe, Stefan Hagedorn, and Kai-Uwe Sattler. 2022. Exploration of Approaches for In-Database ML. In EDBT. OpenProceedings.org, 311--323. https://openproceedings.org/2023/conf/edbt/paper-7.pdfGoogle Scholar
Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12, 11 (2019), 1553--1567.Google ScholarDigital Library
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).Google Scholar
Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review and analysis of the literature (2006--2017). Artif. Intell. Rev. 53, 2 (feb 2020), 1487--1509. Google ScholarDigital Library
Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. 2002. Discretization: An Enabling Technique. Data Min. Knowl. Discov. 6, 4 (oct 2002), 393--423. Google ScholarDigital Library
Maximilian Mayerl, Michael Vötter, Günther Specht, and Eva Zangerle. 2023. Pairwise Learning to Rank for Hit Song Prediction. In BTW 2023. Gesellschaft für Informatik e.V., Bonn, 555--565. Google ScholarCross Ref
Microsoft. 2023. Predict (transact-SQL) - SQL machine learning. Retrieved April 11, 2024 from https://learn.microsoft.com/en-us/sql/t-sql/queries/predict-transact-sql?view=sql-server-ver15Google Scholar
Guillermo Navas-Palencia. 2020. Optimal binning: mathematical programming formulation. abs/2001.08025 (2020). arXiv:2001.08025 [cs.LG]Google Scholar
Stephen M Omohundro. 1989. Five balltree construction algorithms. International Computer Science Institute Berkeley. https://omohundro.files.wordpress.com/2009/03/omohundro89_five_balltree_construction_algorithms.pdfGoogle Scholar
Intel Oneapi-Src. 2020. Oneapi-src/onedal: Oneapi data analytics library (onedal). Retrieved April 11, 2024 from https://github.com/oneapi-src/oneDAL?tab=readme-ov-fileGoogle Scholar
C. Ordonez. 2006. Integrating K-means clustering with a relational DBMS using SQL. TKDE 18, 2 (2006), 188--201.Google ScholarDigital Library
Jia Pan and Dinesh Manocha. 2011. Fast GPU-based locality sensitive hashing for k-nearest neighbor computation. In SIGSPATIAL. 211--220.Google Scholar
Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-End Optimization of Machine Learning Prediction Queries. In SIGMOD. 587--601.Google Scholar
Yongjoo Park, Michael J. Cafarella, and Barzan Mozafari. 2015. Neighbor-Sensitive Hashing. PVLDB 9, 3 (2015), 144--155.Google ScholarDigital Library
Postgresml. 2022. Postgresml/postgresml: PostgresML is an AI application database. Download open source models from Huggingface, or train your own, to create and index LLM embeddings, generate text, or make online predictions using only SQL. Retrieved April 11, 2024 from https://github.com/postgresml/postgresmlGoogle Scholar
Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification. In SSCI. 159--166.Google Scholar
Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, Markus Weimer, Avrilia Floratou, Carlo Curino, and Konstantinos Karanasos. 2022. Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML.NET Pipelines. SIGMOD Rec. 51, 2 (2022), 30--37.Google ScholarDigital Library
Mark Raasveldt, Pedro Holanda, Hannes Mühleisen, and Stefan Manegold. 2018. Deep Integration of Machine Learning Into Column Stores. In EDBT. OpenProceedings.org, 473--476. Google ScholarCross Ref
Maximilian Rieger, Moritz Sichert, and Thomas Neumann. 2022. Integrating deep learning frameworks into main-memory databases. In Proceedings of the VLDB 2022 Applied AI for Database Systems and Applications Workshop co-located with (VLDB 2022) (AIDB Workshop Proceedings). https://drive.google.com/file/d/1GfZH3Y1sQKgplnnpTEM_E4skWdhmyrfe/editGoogle Scholar
Sunita Sarawagi, Shiby Thomas, and Rakesh Agrawal. 1998. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. In SIGMOD. 343--354.Google Scholar
Iqbal H. Sarker. 2021. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2, 3 (mar 2021), 21. Google ScholarDigital Library
Kai-Uwe Sattler and Oliver Dunemann. 2001. SQL Database Primitives for Decision Tree Classifiers. In CIKM. 379--386.Google Scholar
Maximilian Emanuel Schüle, Alfons Kemper, and Thomas Neumann. 2023. NN2SQL: Let SQL Think for Neural Networks. In BTW, Vol. P-331. 183--194.Google Scholar
Maximilian E. Schüle, Luca Scalerandi, Alfons Kemper, and Thomas Neumann. 2023. Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL. In EDBT. 40--52.Google Scholar
Scikit-learn. [n.d.]. 8.2. computational performance. Retrieved April 11, 2024 from https://scikit-learn.org/stable/computing/computational_performance.htmlGoogle Scholar
Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Debo Cheng. 2017. Learning k for KNN Classification. ACM Trans. Intell. Syst. Technol. 8, 3, Article 43 (jan 2017), 19 pages. Google ScholarDigital Library
Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar. 2021. Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. PVLDB 14, 10 (2021), 1769--1782.Google ScholarDigital Library

Cited By

View all

Recommendations

Modelling, Inference and Optimization in Probabilistic Machine Learning
Read More
A hyper-parameter inference for radon transformed image reconstruction using Bayesian inference
MLMI'10: Proceedings of the First international conference on Machine learning in medical imaging
We propose an hyper-parameter inference method in the manner of Bayesian inference for image reconstruction from Radon transformed observation which often appears in the computed tomography. Hyper-parameters are often introduced in Bayesian inference to ...
Read More
Stochastic variational inference
We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet ...
Read More

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Information
Contributors

Published in
Proceedings of the VLDB Endowment Volume 17, Issue 8
April 2024
335 pages
ISSN:2150-8097
Editors:
Meihui Zhang
Beijing Institute of Technology
,
Cyrus Shahabi
University of Southern California
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 31 May 2024
Published in pvldb Volume 17, Issue 8

Check for updates
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Bibliometrics
Citations0

Article Metrics
- Total Citations
  View Citations
- 6
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

View Digital Edition

Figures
Other

Caption

View Issue’s Table of Contents

InferDB: In-Database Machine Learning Inference Using Indexes (2024)

New Citation Alert added!

New Citation Alert!

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Badges

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Export Citations

References