feat: enhance search capabilities and indexing across collections

This commit is contained in:
2026-05-17 20:33:22 +00:00
parent 8cbf0db14f
commit 2d52272b2e
8 changed files with 182 additions and 12 deletions
+58 -11
View File
@@ -79,6 +79,7 @@ Tibi supports multiple search modes via collection `search:` config:
- `filter`
- `ngram`
- `vector`
- `combined`
Use explicit search configs when search is a real product feature. Auto-fallback is useful, but it is not a substitute for a deliberate retrieval model.
@@ -92,7 +93,7 @@ Use when:
- exact field ownership of the text index is clear
- keyword search is enough
Requires a text index.
Requires a MongoDB text index (`$text: $**` or specific).
### `regex`
@@ -100,9 +101,16 @@ Use when:
- the searchable fields are explicit
- case-insensitive matching is enough
- weighted field scoring is useful
- weighted field scoring is useful (via `regex.weights: { "meta.title": 10, path: 5 }`)
Good for smaller datasets or precise keyed fields.
Good for smaller datasets or precise keyed fields. Very easy to configure without external dependencies. Example:
```yaml
search:
- name: default
mode: regex
fields: [title, "alt.de", description]
```
### `filter` or `eval`
@@ -121,23 +129,62 @@ Use when:
- users search codes, names, transliterated terms, or partial inputs
This is enrichment-based search. It stores generated `_search` data and benefits from clear regeneration expectations.
_Note:_ Field weighting is not natively supported inside a single `ngram` mode, because all `fields` are concatenated into one large ngram index block per document.
### `vector`
Use when:
- semantic similarity matters more than literal keyword overlap
- the project can support embedding-provider setup and operator cost expectations
- the project can support embedding-provider setup (e.g. `bge-m3` in `api/config.yml`)
- search quality justifies added complexity
Vector mode can use:
Vector mode requires a registered provider.
- `fields`
- custom `eval` transformation
- `documentPrefix`
- `queryPrefix`
- `overflow: truncate|chunk`
- `rrf` tuning for hybrid scoring
### `combined` (RRF)
Use when:
- Hybrid search is required (e.g. `vector` + `ngram` to catch typos and semantic meaning).
- You need to simulate field-weighting for `vector` or `ngram` by breaking them up into multiple search blocks and fusing them with different weights.
`mode: combined` uses Reciprocal Rank Fusion (RRF). It delegates execution to other configured search blocks (which should be hidden in admin UI via `meta.hide: true`).
**Field-Weighting Workaround with combined:**
Because `vector` and `ngram` concatenate all fields, you can weight highly important fields (like titles) higher than deep content fields by creating multiple ngram/vector blocks and boosting the important one in the `combined` weights:
```yaml
search:
- name: main_search
mode: combined
rrf:
k: 60
topK: 100
weights:
semantic: 1.5
fuzzy_important: 2.0 # Boosts matches in title/headline
fuzzy_content: 0.5 # Lowers weight for deep text matches
meta:
label: { de: "Suche", en: "Search" }
- name: fuzzy_important
mode: ngram
fields: [name, "meta.title", "blocks.headline"]
autoRegenerate: true
meta: { hide: true }
- name: fuzzy_content
mode: ngram
fields: ["blocks.text", "blocks.items.answer"]
autoRegenerate: true
meta: { hide: true }
- name: semantic
mode: vector
fields: [name, "meta.title", "blocks.text"]
vector: { provider: bge-m3 }
autoRegenerate: true
```
## Auto-regeneration and admin flows