Berlin Buzzwords 2019: Michael Gibney – Precise graph-based phrase query with SpanNearQuery

Опубликовано: 03 Ноябрь 2024
на канале: Plain Schwarz
238
4

The current implementation of SpanNearQuery in Apache Lucene sacrifices precision and completeness in favor of performance. This tradeoff significantly limits Lucene's usefulness for some potential high-value use cases. This talk introduces a stable patch that makes SpanQuery matching precise and complete, while maintaining performance comparable to that of the extant implementation.

The patch will be discussed in the context of the issue LUCENE-7398, describing the central problem and how it differs from the problem addressed by the new IntervalsSource API.

We will discuss details of the patch implementation, in an effort to familiarize the audience with techniques involved (and hopefully inspire further innovation). Details covered will include:

1. The main word-lattice data structure (linked, circular, resizable-array-backed, 2-dimensional queue with stable node references and support for efficient binary seek, serial traversal, and arbitrary node removal).
2. Performance/GC management for linked data structures (as opposed to the array-based data structures that are more common in the Lucene codebase)
3. Recording in the index: positionLength, nextStartPosition lookahead, and information about possible decrease of endPosition (de-"sausagization" of the index)

Read more:
https://2019.berlinbuzzwords.de/19/se...

About Michael Gibney:
https://2019.berlinbuzzwords.de/users...

Website: https://berlinbuzzwords.de/
Twitter:   / berlinbuzzwords  
LinkedIn:   / berlin-buzzwords  
Reddit:   / berlinbuzzwords