research-article Artifacts Available / v1.1
- Authors:
- Ricardo Salazar-Díaz Hasso Plattner Institute, University of Potsdam
Hasso Plattner Institute, University of Potsdam
Search about this author
- Boris Glavic University of Illinois Chicago
University of Illinois Chicago
Search about this author
- Tilmann Rabl Hasso Plattner Institute, University of Potsdam
Hasso Plattner Institute, University of Potsdam
Search about this author
Proceedings of the VLDB EndowmentVolume 17Issue 8pp 1830–1842https://doi.org/10.14778/3659437.3659441
- 0citation
- 6
- Downloads
Metrics
Total Citations0Total Downloads6Last 12 Months6
Last 6 weeks6
- Get Citation Alerts
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
- Publisher Site
- Get Access
Proceedings of the VLDB Endowment
Volume 17, Issue 8
PreviousArticleNextArticle
Abstract
The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing workflow followed by prediction with an ML model. Current approaches for in-database inference implement preprocessing operators and ML algorithms in the database either natively, by transpiling code to SQL, or by executing user-defined functions in guest languages such as Python. In this work, we present a radically different approach that approximates an end-to-end inference pipeline (preprocessing plus prediction) using a light-weight embedding that discretizes a carefully selected subset of the input features and an index that maps data points in the embedding space to aggregated predictions of an ML model. We replace a complex preprocessing workflow and model-based inference with a simple feature transformation and an index lookup. Our framework improves inference latency by several orders of magnitude while maintaining similar prediction accuracy compared to the pipeline it approximates.
References
- Rakesh Agrawal and Kyuseok Shim. 1996. Developing Tightly-Coupled Data Mining Applications on a Relational Database System. In SIGKDD. 287--290.Google Scholar
- Amazon. 2020. Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML. Retrieved April 11, 2024 from https://aws.amazon.com/de/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/Google Scholar
- Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509--517.Google Scholar
Digital Library
- Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. PVLDB 9, 13 (2016), 1425--1436.Google Scholar
Digital Library
- İlkay Çınar and Murat Koklu. 2022. Identification of Rice Varieties Using Machine Learning Algorithms. Journal of Agricultural Sciences 28, 2 (2022), 307 -- 325. Google Scholar
Cross Ref
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 785--794. Google Scholar
Digital Library
- T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1 (1967), 21--27.Google Scholar
Digital Library
- Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. 2011. Fast locality-sensitive hashing. In SIGKDD. 1073--1081.Google Scholar
- James Dougherty, Ron Kohavi, and Mehran Sahami. 1995. Supervised and Unsupervised Discretization of Continuous Features. In ICML. 194--202.Google Scholar
- Usama M. Fayyad and Keki B. Irani. 1993. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In IJCAI. 1022--1029.Google Scholar
- Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a Unified Architecture for In-RDBMS Analytics. In SIGMOD. 325--336.Google Scholar
- Centers for Disease Control and Prevention. 2023. Daily Census Tract-level PM2.5 concentrations, 2016-2020. Retrieved April 11, 2024 from https://healthdata.gov/dataset/Daily-Census-Tract-Level-PM2-5-Concentrations-2016/k9st-jhz8/dataGoogle Scholar
- Walid G. and Ihab F. Ilyas. 2001. SP-GIST: An extensible database index for supporting Space Partitioning Trees. Journal of Intelligent Information Systems, 215--240.Google Scholar
- Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesus Camacho-Rodriguez, and Matteo Interlandi. 2023. The Tensor Data Platform: Towards an AI-centric Database System. CIDR (2023).Google Scholar
- Google. 2023. Make predictions with imported TensorFlow models. Retrieved April 11, 2024 from https://cloud.google.com/bigquery/docs/making-predictions-with-imported-tensorflow-models?hl=deGoogle Scholar
- Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. Springer Berlin Heidelberg, 986--996. https://link.springer.com/chapter/10.1007/978-3-540-39964-3_62#citeasGoogle Scholar
- Isabelle Guyon and André Elisseeff. 2003. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, null (2003), 1157--1182.Google Scholar
- Omid Jafari and Parth Nagarkar. 2021. Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches. In ADC, Vol. 12610. 62--73.Google Scholar
- Andy Jassy. 2018. AWS re:Invent 2018 keynote. Video. Retrieved April 11, 2024 from https://www.youtube.com/watch?v=ZOIkOnW640AGoogle Scholar
- Harold Jeffreys. 1946. An invariant form for the prior probability in estimation problems. In Proceedings of the Royal Society A. Royal Society. Google Scholar
Cross Ref
- Kannadasan K, Haresh M V, Ambati Rami Reddy, and B. Shameedha Begum. 2023. BCIRecog: An Optimized BCI System for Imagined Speech Recognition. In ICCCNT. 1--7. Google Scholar
Cross Ref
- Kaggle. 2017. New York City taxi trip duration. Retrieved April 11, 2024 from https://www.kaggle.com/c/nyc-taxi-trip-durationGoogle Scholar
- Kaggle. 2017. New York City taxi trip duration evaluation. Retrieved April 11, 2024 from https://www.kaggle.com/competitions/nyc-taxi-trip-duration/overview/evaluationGoogle Scholar
- Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. ExploreKit: Automatic Feature Generation and Selection. In ICDM. 979--984.Google Scholar
- Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In NIPS (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 3149--3157.Google Scholar
- Steffen Kläbe and Stefan Hagedorn. 2021. Applying Machine Learning Models to Scalable DataFrames with Grizzly. BTW 2021., 195--214 pages. Google Scholar
Cross Ref
- Steffen Kläbe, Stefan Hagedorn, and Kai-Uwe Sattler. 2022. Exploration of Approaches for In-Database ML. In EDBT. OpenProceedings.org, 311--323. https://openproceedings.org/2023/conf/edbt/paper-7.pdfGoogle Scholar
- Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12, 11 (2019), 1553--1567.Google Scholar
Digital Library
- Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).Google Scholar
- Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review and analysis of the literature (2006--2017). Artif. Intell. Rev. 53, 2 (feb 2020), 1487--1509. Google Scholar
Digital Library
- Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. 2002. Discretization: An Enabling Technique. Data Min. Knowl. Discov. 6, 4 (oct 2002), 393--423. Google Scholar
Digital Library
- Maximilian Mayerl, Michael Vötter, Günther Specht, and Eva Zangerle. 2023. Pairwise Learning to Rank for Hit Song Prediction. In BTW 2023. Gesellschaft für Informatik e.V., Bonn, 555--565. Google Scholar
Cross Ref
- Microsoft. 2023. Predict (transact-SQL) - SQL machine learning. Retrieved April 11, 2024 from https://learn.microsoft.com/en-us/sql/t-sql/queries/predict-transact-sql?view=sql-server-ver15Google Scholar
- Guillermo Navas-Palencia. 2020. Optimal binning: mathematical programming formulation. abs/2001.08025 (2020). arXiv:2001.08025 [cs.LG]Google Scholar
- Stephen M Omohundro. 1989. Five balltree construction algorithms. International Computer Science Institute Berkeley. https://omohundro.files.wordpress.com/2009/03/omohundro89_five_balltree_construction_algorithms.pdfGoogle Scholar
- Intel Oneapi-Src. 2020. Oneapi-src/onedal: Oneapi data analytics library (onedal). Retrieved April 11, 2024 from https://github.com/oneapi-src/oneDAL?tab=readme-ov-fileGoogle Scholar
- C. Ordonez. 2006. Integrating K-means clustering with a relational DBMS using SQL. TKDE 18, 2 (2006), 188--201.Google Scholar
Digital Library
- Jia Pan and Dinesh Manocha. 2011. Fast GPU-based locality sensitive hashing for k-nearest neighbor computation. In SIGSPATIAL. 211--220.Google Scholar
- Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-End Optimization of Machine Learning Prediction Queries. In SIGMOD. 587--601.Google Scholar
- Yongjoo Park, Michael J. Cafarella, and Barzan Mozafari. 2015. Neighbor-Sensitive Hashing. PVLDB 9, 3 (2015), 144--155.Google Scholar
Digital Library
- Postgresml. 2022. Postgresml/postgresml: PostgresML is an AI application database. Download open source models from Huggingface, or train your own, to create and index LLM embeddings, generate text, or make online predictions using only SQL. Retrieved April 11, 2024 from https://github.com/postgresml/postgresmlGoogle Scholar
- Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification. In SSCI. 159--166.Google Scholar
- Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, Markus Weimer, Avrilia Floratou, Carlo Curino, and Konstantinos Karanasos. 2022. Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML.NET Pipelines. SIGMOD Rec. 51, 2 (2022), 30--37.Google Scholar
Digital Library
- Mark Raasveldt, Pedro Holanda, Hannes Mühleisen, and Stefan Manegold. 2018. Deep Integration of Machine Learning Into Column Stores. In EDBT. OpenProceedings.org, 473--476. Google Scholar
Cross Ref
- Maximilian Rieger, Moritz Sichert, and Thomas Neumann. 2022. Integrating deep learning frameworks into main-memory databases. In Proceedings of the VLDB 2022 Applied AI for Database Systems and Applications Workshop co-located with (VLDB 2022) (AIDB Workshop Proceedings). https://drive.google.com/file/d/1GfZH3Y1sQKgplnnpTEM_E4skWdhmyrfe/editGoogle Scholar
- Sunita Sarawagi, Shiby Thomas, and Rakesh Agrawal. 1998. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. In SIGMOD. 343--354.Google Scholar
- Iqbal H. Sarker. 2021. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2, 3 (mar 2021), 21. Google Scholar
Digital Library
- Kai-Uwe Sattler and Oliver Dunemann. 2001. SQL Database Primitives for Decision Tree Classifiers. In CIKM. 379--386.Google Scholar
- Maximilian Emanuel Schüle, Alfons Kemper, and Thomas Neumann. 2023. NN2SQL: Let SQL Think for Neural Networks. In BTW, Vol. P-331. 183--194.Google Scholar
- Maximilian E. Schüle, Luca Scalerandi, Alfons Kemper, and Thomas Neumann. 2023. Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL. In EDBT. 40--52.Google Scholar
- Scikit-learn. [n.d.]. 8.2. computational performance. Retrieved April 11, 2024 from https://scikit-learn.org/stable/computing/computational_performance.htmlGoogle Scholar
- Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Debo Cheng. 2017. Learning k for KNN Classification. ACM Trans. Intell. Syst. Technol. 8, 3, Article 43 (jan 2017), 19 pages. Google Scholar
Digital Library
- Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar. 2021. Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. PVLDB 14, 10 (2021), 1769--1782.Google Scholar
Digital Library
Cited By
View all
Recommendations
- Modelling, Inference and Optimization in Probabilistic Machine Learning
Read More
- A hyper-parameter inference for radon transformed image reconstruction using Bayesian inference
MLMI'10: Proceedings of the First international conference on Machine learning in medical imaging
We propose an hyper-parameter inference method in the manner of Bayesian inference for image reconstruction from Radon transformed observation which often appears in the computed tomography. Hyper-parameters are often introduced in Bayesian inference to ...
Read More
- Stochastic variational inference
We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet ...
Read More
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Article
- Information
- Contributors
Published in
Proceedings of the VLDB Endowment Volume 17, Issue 8
April 2024
335 pages
ISSN:2150-8097
- Editors:
- Meihui Zhang
Beijing Institute of Technology
, - Cyrus Shahabi
University of Southern California
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 31 May 2024
Published in pvldb Volume 17, Issue 8
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics
- Bibliometrics
- Citations0
Article Metrics
- View Citations
Total Citations
6
Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet
PDF Format
View or Download as a PDF file.
eReader
View online with eReader.
eReader
Digital Edition
View this article in digital edition.
View Digital Edition
- Figures
- Other
Close Figure Viewer
Browse AllReturn
Caption
View Issue’s Table of Contents