InferDB: In-Database Machine Learning Inference Using Indexes (2024)

  • Authors:
  • Ricardo Salazar-Díaz Hasso Plattner Institute, University of Potsdam

    Hasso Plattner Institute, University of Potsdam

    Search about this author

    ,
  • Boris Glavic University of Illinois Chicago

    University of Illinois Chicago

    Search about this author

    ,
  • Tilmann Rabl Hasso Plattner Institute, University of Potsdam

    Hasso Plattner Institute, University of Potsdam

    Search about this author

Proceedings of the VLDB EndowmentVolume 17Issue 8pp 1830–1842https://doi.org/10.14778/3659437.3659441

Published:31 May 2024Publication HistoryInferDB: In-Database Machine Learning Inference Using Indexes (2)

  • 0citation
  • 6
  • Downloads

Metrics

Total Citations0Total Downloads6

Last 12 Months6

Last 6 weeks6

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • Publisher Site
  • Get Access

Proceedings of the VLDB Endowment

Volume 17, Issue 8

PreviousArticleNextArticle

InferDB: In-Database Machine Learning Inference Using Indexes (3)

Skip Abstract Section

Abstract

The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing workflow followed by prediction with an ML model. Current approaches for in-database inference implement preprocessing operators and ML algorithms in the database either natively, by transpiling code to SQL, or by executing user-defined functions in guest languages such as Python. In this work, we present a radically different approach that approximates an end-to-end inference pipeline (preprocessing plus prediction) using a light-weight embedding that discretizes a carefully selected subset of the input features and an index that maps data points in the embedding space to aggregated predictions of an ML model. We replace a complex preprocessing workflow and model-based inference with a simple feature transformation and an index lookup. Our framework improves inference latency by several orders of magnitude while maintaining similar prediction accuracy compared to the pipeline it approximates.

References

  1. Rakesh Agrawal and Kyuseok Shim. 1996. Developing Tightly-Coupled Data Mining Applications on a Relational Database System. In SIGKDD. 287--290.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (4)
  2. Amazon. 2020. Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML. Retrieved April 11, 2024 from https://aws.amazon.com/de/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (5)
  3. Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509--517.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (6)Digital Library
  4. Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. PVLDB 9, 13 (2016), 1425--1436.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (8)Digital Library
  5. İlkay Çınar and Murat Koklu. 2022. Identification of Rice Varieties Using Machine Learning Algorithms. Journal of Agricultural Sciences 28, 2 (2022), 307 -- 325. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (10)Cross Ref
  6. Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 785--794. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (12)Digital Library
  7. T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1 (1967), 21--27.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (14)Digital Library
  8. Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. 2011. Fast locality-sensitive hashing. In SIGKDD. 1073--1081.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (16)
  9. James Dougherty, Ron Kohavi, and Mehran Sahami. 1995. Supervised and Unsupervised Discretization of Continuous Features. In ICML. 194--202.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (17)
  10. Usama M. Fayyad and Keki B. Irani. 1993. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In IJCAI. 1022--1029.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (18)
  11. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a Unified Architecture for In-RDBMS Analytics. In SIGMOD. 325--336.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (19)
  12. Centers for Disease Control and Prevention. 2023. Daily Census Tract-level PM2.5 concentrations, 2016-2020. Retrieved April 11, 2024 from https://healthdata.gov/dataset/Daily-Census-Tract-Level-PM2-5-Concentrations-2016/k9st-jhz8/dataGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (20)
  13. Walid G. and Ihab F. Ilyas. 2001. SP-GIST: An extensible database index for supporting Space Partitioning Trees. Journal of Intelligent Information Systems, 215--240.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (21)
  14. Apurva Gandhi, Yuki Asada, Victor Fu, Advitya Gemawat, Lihao Zhang, Rathijit Sen, Carlo Curino, Jesus Camacho-Rodriguez, and Matteo Interlandi. 2023. The Tensor Data Platform: Towards an AI-centric Database System. CIDR (2023).Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (22)
  15. Google. 2023. Make predictions with imported TensorFlow models. Retrieved April 11, 2024 from https://cloud.google.com/bigquery/docs/making-predictions-with-imported-tensorflow-models?hl=deGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (23)
  16. Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. 2003. KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. Springer Berlin Heidelberg, 986--996. https://link.springer.com/chapter/10.1007/978-3-540-39964-3_62#citeasGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (24)
  17. Isabelle Guyon and André Elisseeff. 2003. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, null (2003), 1157--1182.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (25)
  18. Omid Jafari and Parth Nagarkar. 2021. Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches. In ADC, Vol. 12610. 62--73.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (26)
  19. Andy Jassy. 2018. AWS re:Invent 2018 keynote. Video. Retrieved April 11, 2024 from https://www.youtube.com/watch?v=ZOIkOnW640AGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (27)
  20. Harold Jeffreys. 1946. An invariant form for the prior probability in estimation problems. In Proceedings of the Royal Society A. Royal Society. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (28)Cross Ref
  21. Kannadasan K, Haresh M V, Ambati Rami Reddy, and B. Shameedha Begum. 2023. BCIRecog: An Optimized BCI System for Imagined Speech Recognition. In ICCCNT. 1--7. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (30)Cross Ref
  22. Kaggle. 2017. New York City taxi trip duration. Retrieved April 11, 2024 from https://www.kaggle.com/c/nyc-taxi-trip-durationGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (32)
  23. Kaggle. 2017. New York City taxi trip duration evaluation. Retrieved April 11, 2024 from https://www.kaggle.com/competitions/nyc-taxi-trip-duration/overview/evaluationGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (33)
  24. Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. ExploreKit: Automatic Feature Generation and Selection. In ICDM. 979--984.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (34)
  25. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In NIPS (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 3149--3157.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (35)
  26. Steffen Kläbe and Stefan Hagedorn. 2021. Applying Machine Learning Models to Scalable DataFrames with Grizzly. BTW 2021., 195--214 pages. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (36)Cross Ref
  27. Steffen Kläbe, Stefan Hagedorn, and Kai-Uwe Sattler. 2022. Exploration of Approaches for In-Database ML. In EDBT. OpenProceedings.org, 311--323. https://openproceedings.org/2023/conf/edbt/paper-7.pdfGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (38)
  28. Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12, 11 (2019), 1553--1567.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (39)Digital Library
  29. Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010).Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (41)
  30. Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review and analysis of the literature (2006--2017). Artif. Intell. Rev. 53, 2 (feb 2020), 1487--1509. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (42)Digital Library
  31. Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. 2002. Discretization: An Enabling Technique. Data Min. Knowl. Discov. 6, 4 (oct 2002), 393--423. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (44)Digital Library
  32. Maximilian Mayerl, Michael Vötter, Günther Specht, and Eva Zangerle. 2023. Pairwise Learning to Rank for Hit Song Prediction. In BTW 2023. Gesellschaft für Informatik e.V., Bonn, 555--565. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (46)Cross Ref
  33. Microsoft. 2023. Predict (transact-SQL) - SQL machine learning. Retrieved April 11, 2024 from https://learn.microsoft.com/en-us/sql/t-sql/queries/predict-transact-sql?view=sql-server-ver15Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (48)
  34. Guillermo Navas-Palencia. 2020. Optimal binning: mathematical programming formulation. abs/2001.08025 (2020). arXiv:2001.08025 [cs.LG]Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (49)
  35. Stephen M Omohundro. 1989. Five balltree construction algorithms. International Computer Science Institute Berkeley. https://omohundro.files.wordpress.com/2009/03/omohundro89_five_balltree_construction_algorithms.pdfGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (50)
  36. Intel Oneapi-Src. 2020. Oneapi-src/onedal: Oneapi data analytics library (onedal). Retrieved April 11, 2024 from https://github.com/oneapi-src/oneDAL?tab=readme-ov-fileGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (51)
  37. C. Ordonez. 2006. Integrating K-means clustering with a relational DBMS using SQL. TKDE 18, 2 (2006), 188--201.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (52)Digital Library
  38. Jia Pan and Dinesh Manocha. 2011. Fast GPU-based locality sensitive hashing for k-nearest neighbor computation. In SIGSPATIAL. 211--220.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (54)
  39. Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. 2022. End-to-End Optimization of Machine Learning Prediction Queries. In SIGMOD. 587--601.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (55)
  40. Yongjoo Park, Michael J. Cafarella, and Barzan Mozafari. 2015. Neighbor-Sensitive Hashing. PVLDB 9, 3 (2015), 144--155.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (56)Digital Library
  41. Postgresml. 2022. Postgresml/postgresml: PostgresML is an AI application database. Download open source models from Huggingface, or train your own, to create and index LLM embeddings, generate text, or make online predictions using only SQL. Retrieved April 11, 2024 from https://github.com/postgresml/postgresmlGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (58)
  42. Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification. In SSCI. 159--166.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (59)
  43. Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, Markus Weimer, Avrilia Floratou, Carlo Curino, and Konstantinos Karanasos. 2022. Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML.NET Pipelines. SIGMOD Rec. 51, 2 (2022), 30--37.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (60)Digital Library
  44. Mark Raasveldt, Pedro Holanda, Hannes Mühleisen, and Stefan Manegold. 2018. Deep Integration of Machine Learning Into Column Stores. In EDBT. OpenProceedings.org, 473--476. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (62)Cross Ref
  45. Maximilian Rieger, Moritz Sichert, and Thomas Neumann. 2022. Integrating deep learning frameworks into main-memory databases. In Proceedings of the VLDB 2022 Applied AI for Database Systems and Applications Workshop co-located with (VLDB 2022) (AIDB Workshop Proceedings). https://drive.google.com/file/d/1GfZH3Y1sQKgplnnpTEM_E4skWdhmyrfe/editGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (64)
  46. Sunita Sarawagi, Shiby Thomas, and Rakesh Agrawal. 1998. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications. In SIGMOD. 343--354.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (65)
  47. Iqbal H. Sarker. 2021. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Comput. Sci. 2, 3 (mar 2021), 21. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (66)Digital Library
  48. Kai-Uwe Sattler and Oliver Dunemann. 2001. SQL Database Primitives for Decision Tree Classifiers. In CIKM. 379--386.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (68)
  49. Maximilian Emanuel Schüle, Alfons Kemper, and Thomas Neumann. 2023. NN2SQL: Let SQL Think for Neural Networks. In BTW, Vol. P-331. 183--194.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (69)
  50. Maximilian E. Schüle, Luca Scalerandi, Alfons Kemper, and Thomas Neumann. 2023. Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL. In EDBT. 40--52.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (70)
  51. Scikit-learn. [n.d.]. 8.2. computational performance. Retrieved April 11, 2024 from https://scikit-learn.org/stable/computing/computational_performance.htmlGoogle ScholarInferDB: In-Database Machine Learning Inference Using Indexes (71)
  52. Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Debo Cheng. 2017. Learning k for KNN Classification. ACM Trans. Intell. Syst. Technol. 8, 3, Article 43 (jan 2017), 19 pages. Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (72)Digital Library
  53. Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar. 2021. Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. PVLDB 14, 10 (2021), 1769--1782.Google ScholarInferDB: In-Database Machine Learning Inference Using Indexes (74)Digital Library

Cited By

View all

InferDB: In-Database Machine Learning Inference Using Indexes (76)

    Recommendations

    • Modelling, Inference and Optimization in Probabilistic Machine Learning

      Read More

    • A hyper-parameter inference for radon transformed image reconstruction using Bayesian inference

      MLMI'10: Proceedings of the First international conference on Machine learning in medical imaging

      We propose an hyper-parameter inference method in the manner of Bayesian inference for image reconstruction from Radon transformed observation which often appears in the computed tomography. Hyper-parameters are often introduced in Bayesian inference to ...

      Read More

    • Stochastic variational inference

      We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet ...

      Read More

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    Get this Article

    • Information
    • Contributors
    • Published in

      InferDB: In-Database Machine Learning Inference Using Indexes (77)

      Proceedings of the VLDB Endowment Volume 17, Issue 8

      April 2024

      335 pages

      ISSN:2150-8097

      • Editors:
      • Meihui Zhang

        Beijing Institute of Technology

        ,
      • Cyrus Shahabi

        University of Southern California

      Issue’s Table of Contents

      Sponsors

        In-Cooperation

          Publisher

          VLDB Endowment

          Publication History

          • Published: 31 May 2024

          Published in pvldb Volume 17, Issue 8

          Check for updates

          InferDB: In-Database Machine Learning Inference Using Indexes (78)

          Qualifiers

          • research-article

          Conference

          Funding Sources

          • InferDB: In-Database Machine Learning Inference Using Indexes (80)

            Other Metrics

            View Article Metrics

          • Bibliometrics
          • Citations0
          • Article Metrics

            • Total Citations

              View Citations
            • 6

              Total Downloads

            • Downloads (Last 12 months)6
            • Downloads (Last 6 weeks)6

            Other Metrics

            View Author Metrics

          • Cited By

            This publication has not been cited yet

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Digital Edition

          View this article in digital edition.

          View Digital Edition

          • Figures
          • Other

            Close Figure Viewer

            Browse AllReturn

            Caption

            View Issue’s Table of Contents

            Export Citations

              InferDB: In-Database Machine Learning Inference Using Indexes (2024)

              References

              Top Articles
              Latest Posts
              Article information

              Author: Terrell Hackett

              Last Updated:

              Views: 6402

              Rating: 4.1 / 5 (52 voted)

              Reviews: 83% of readers found this page helpful

              Author information

              Name: Terrell Hackett

              Birthday: 1992-03-17

              Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

              Phone: +21811810803470

              Job: Chief Representative

              Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

              Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.