✨ feat: enhance search capabilities and indexing across collections

2026-05-17 20:33:22 +00:00
parent 8cbf0db14f
commit 2d52272b2e
8 changed files with 182 additions and 12 deletions
@@ -79,6 +79,7 @@ Tibi supports multiple search modes via collection `search:` config:
 - `filter`
 - `ngram`
 - `vector`
+- `combined`

 Use explicit search configs when search is a real product feature. Auto-fallback is useful, but it is not a substitute for a deliberate retrieval model.

@@ -92,7 +93,7 @@ Use when:
 - exact field ownership of the text index is clear
 - keyword search is enough

-Requires a text index.
+Requires a MongoDB text index (`$text: $**` or specific).

 ### `regex`

@@ -100,9 +101,16 @@ Use when:

 - the searchable fields are explicit
 - case-insensitive matching is enough
- weighted field scoring is useful
+- weighted field scoring is useful (via `regex.weights: { "meta.title": 10, path: 5 }`)

-Good for smaller datasets or precise keyed fields.
+Good for smaller datasets or precise keyed fields. Very easy to configure without external dependencies. Example:
+
+```yaml
+search:
+    - name: default
+      mode: regex
+      fields: [title, "alt.de", description]
+```

 ### `filter` or `eval`

@@ -121,23 +129,62 @@ Use when:
 - users search codes, names, transliterated terms, or partial inputs

 This is enrichment-based search. It stores generated `_search` data and benefits from clear regeneration expectations.
+_Note:_ Field weighting is not natively supported inside a single `ngram` mode, because all `fields` are concatenated into one large ngram index block per document.

 ### `vector`

 Use when:

 - semantic similarity matters more than literal keyword overlap
- the project can support embedding-provider setup and operator cost expectations
+- the project can support embedding-provider setup (e.g. `bge-m3` in `api/config.yml`)
 - search quality justifies added complexity

-Vector mode can use:
+Vector mode requires a registered provider.

- `fields`
- custom `eval` transformation
- `documentPrefix`
- `queryPrefix`
- `overflow: truncate|chunk`
- `rrf` tuning for hybrid scoring
+### `combined` (RRF)
+
+Use when:
+
+- Hybrid search is required (e.g. `vector` + `ngram` to catch typos and semantic meaning).
+- You need to simulate field-weighting for `vector` or `ngram` by breaking them up into multiple search blocks and fusing them with different weights.
+
+`mode: combined` uses Reciprocal Rank Fusion (RRF). It delegates execution to other configured search blocks (which should be hidden in admin UI via `meta.hide: true`).
+
+**Field-Weighting Workaround with combined:**
+Because `vector` and `ngram` concatenate all fields, you can weight highly important fields (like titles) higher than deep content fields by creating multiple ngram/vector blocks and boosting the important one in the `combined` weights:
+
+```yaml
+search:
+    - name: main_search
+      mode: combined
+      rrf:
+          k: 60
+          topK: 100
+          weights:
+              semantic: 1.5
+              fuzzy_important: 2.0 # Boosts matches in title/headline
+              fuzzy_content: 0.5 # Lowers weight for deep text matches
+      meta:
+          label: { de: "Suche", en: "Search" }
+
+    - name: fuzzy_important
+      mode: ngram
+      fields: [name, "meta.title", "blocks.headline"]
+      autoRegenerate: true
+      meta: { hide: true }
+
+    - name: fuzzy_content
+      mode: ngram
+      fields: ["blocks.text", "blocks.items.answer"]
+      autoRegenerate: true
+      meta: { hide: true }
+
+    - name: semantic
+      mode: vector
+      fields: [name, "meta.title", "blocks.text"]
+      vector: { provider: bge-m3 }
+      autoRegenerate: true
+```

 ## Auto-regeneration and admin flows