Title: Efficient Locality Sensitive Hashing Approaches, Primitives, and Applications
Jingfan Meng
Ph.D. Student
School of Computer Science, Colleague of Computing
Georgia Tech
Date: Wednesday, November 20th, 2024
Time: 10:00 AM -- 11:00 AM EST
Location: Klaus 2108
Microsoft Teams Meeting: Link
Committee:
Dr. Jun (Jim) Xu (Ph.D. advisor), School of Computer Science, Georgia Tech
Dr. Kexin Rong, School of Computer Science, Georgia Tech
Dr. Joy Arulraj, School of Computer Science, Georgia Tech
Dr. Santosh Vempala, School of Computer Science, Georgia Tech
Dr. Mitsunori Ogihara, Department of Computer Science, University of Miami
Abstract:
Approximate nearest neighbor search (ANNS) is a fundamental algorithmic problem with numerous applications in many areas of computer science, especially databases and machine learning. It is an intriguing question how to build index data structures in support for efficient ANNS under various useful distance metrics. Locality sensitive hashing (LSH) is a classical solution approach to ANNS that offers a small index size and reduced costs for index construction and data changes. With numerous LSH approaches in the literature, however, most research has been limited to the collision probability of LSH functions, and many important aspects, such as the fundamental time complexities of evaluating these randomized LSH functions, have been overlooked for long.
My research spans both the design of new LSH approaches for ANNS under hard-to-query distance metrics and the design of new algorithms that speed up the evaluation of these LSH functions. In this proposal, I will cover novel LSH-based ANNS solutions to two such distance metrics -- point-to-subspace metric in L1 (P2SL1), and generalized point-to-subspace metric in L2 (Mahalanobis distance) -- both of which are useful but currently lack efficient solutions. I will also cover two algorithmic techniques -- efficient range summation (ERS) and fast Gaussian orthogonal ensemble quadratic form (FGoeQF) -- that help build up LSH primitives that are faster than existing versions by orders of magnitudes. Finally, I will cover candidate-based density estimation (CanDE), which is an LSH-based application for getting important data analytics in the neighborhood of the query, using our novel estimators that approximate the actual on-table collision rate for the first time ever.