forked from cms/tibi-svelte-starter
4020ad62c5
- Implemented `resolveApiAssetUrl` function to normalize asset URLs based on API base. - Updated `MedialibImage` component to utilize new asset URL resolution and added support for alt text and class properties. - Enhanced image loading behavior with improved width measurement and focal point handling. - Added placeholder image handling and improved accessibility with alt text. - Introduced new test script for auditing broken links in skill documentation. - Expanded seeded test content to include medialib entries and updated related tests for pagebuilder previews. - Improved global setup and teardown logging for clarity on seeded content management.
192 lines
5.8 KiB
Markdown
192 lines
5.8 KiB
Markdown
---
|
|
name: search-and-embeddings
|
|
description: Model search and semantic retrieval for tibi website projects. Covers embedding provider configuration, collection search modes, auto-regeneration, regenerate-search admin flows, and how later agents should decide between no search, classic search, ngram search, and vector search.
|
|
---
|
|
|
|
# search-and-embeddings
|
|
|
|
## When to use this skill
|
|
|
|
Use this skill when:
|
|
|
|
- a project needs explicit search behavior beyond generic CRUD filtering
|
|
- search should be typo-tolerant, weighted, or semantic
|
|
- embedding providers must be configured
|
|
- later agents need a clear yes/no decision for search instead of vague optionality
|
|
|
|
## Goal
|
|
|
|
Give later agents a practical workflow for deciding whether search is needed and, if yes, which search mode belongs to the project.
|
|
|
|
This skill is separate from editor AI features. Search and embeddings affect content retrieval, operational setup, and index/regeneration behavior, not just editor assistance.
|
|
|
|
## Source of truth
|
|
|
|
Use these sources when implementing or reviewing search behavior:
|
|
|
|
- `tibi-server/docs/02-configuration.md`
|
|
- `tibi-server/docs/04-collections.md`
|
|
- `tibi-server/docs/09-llm-integration.md`
|
|
- `.agents/skills/nova-ai-editor-features/SKILL.md`
|
|
- `.agents/skills/mongodb-and-indexes/SKILL.md`
|
|
|
|
## First decision: no search vs explicit search
|
|
|
|
Do not leave search in an implied state.
|
|
|
|
Make one explicit decision:
|
|
|
|
- no search in this project
|
|
- classic keyword search only
|
|
- fuzzy substring search (`ngram`)
|
|
- semantic/vector search
|
|
- hybrid search with deliberate ranking behavior
|
|
|
|
If the answer is “not used”, document that clearly so later agents do not accidentally wire providers or regress into half-configured search.
|
|
|
|
## Server-level provider setup
|
|
|
|
Embedding providers are configured server-side:
|
|
|
|
```yaml
|
|
embedding:
|
|
providers:
|
|
- name: bge-m3
|
|
type: native
|
|
modelPath: /models/bge-m3
|
|
dimensions: 1024
|
|
- name: openai-embed
|
|
type: openai
|
|
model: text-embedding-3-small
|
|
apiKey: ${EMBEDDING_OPENAI-EMBED_APIKEY}
|
|
baseURL: https://api.openai.com/v1
|
|
dimensions: 1536
|
|
```
|
|
|
|
Important:
|
|
|
|
- collection search config references the provider by name
|
|
- embedding secrets and model paths can come from environment variables
|
|
- vector search is not only a collection concern; the server must actually provide the embedding backend
|
|
|
|
## Collection search modes
|
|
|
|
Tibi supports multiple search modes via collection `search:` config:
|
|
|
|
- `text`
|
|
- `regex`
|
|
- `eval`
|
|
- `filter`
|
|
- `ngram`
|
|
- `vector`
|
|
|
|
Use explicit search configs when search is a real product feature. Auto-fallback is useful, but it is not a substitute for a deliberate retrieval model.
|
|
|
|
## Choosing the right mode
|
|
|
|
### `text`
|
|
|
|
Use when:
|
|
|
|
- MongoDB text indexing is sufficient
|
|
- exact field ownership of the text index is clear
|
|
- keyword search is enough
|
|
|
|
Requires a text index.
|
|
|
|
### `regex`
|
|
|
|
Use when:
|
|
|
|
- the searchable fields are explicit
|
|
- case-insensitive matching is enough
|
|
- weighted field scoring is useful
|
|
|
|
Good for smaller datasets or precise keyed fields.
|
|
|
|
### `filter` or `eval`
|
|
|
|
Use when:
|
|
|
|
- search logic depends on auth, project context, or business-specific filtering
|
|
- plain keyword matching is not the full contract
|
|
|
|
Treat these as controlled power tools. The resulting filters are still sanitized against blocked operators.
|
|
|
|
### `ngram`
|
|
|
|
Use when:
|
|
|
|
- typo tolerance or substring matching is needed
|
|
- users search codes, names, transliterated terms, or partial inputs
|
|
|
|
This is enrichment-based search. It stores generated `_search` data and benefits from clear regeneration expectations.
|
|
|
|
### `vector`
|
|
|
|
Use when:
|
|
|
|
- semantic similarity matters more than literal keyword overlap
|
|
- the project can support embedding-provider setup and operator cost expectations
|
|
- search quality justifies added complexity
|
|
|
|
Vector mode can use:
|
|
|
|
- `fields`
|
|
- custom `eval` transformation
|
|
- `documentPrefix`
|
|
- `queryPrefix`
|
|
- `overflow: truncate|chunk`
|
|
- `rrf` tuning for hybrid scoring
|
|
|
|
## Auto-regeneration and admin flows
|
|
|
|
For `ngram` and `vector`, `autoRegenerate: true` can refresh stale enrichment data after config changes.
|
|
|
|
If regeneration is needed manually, the admin flow depends on project admin tokens with:
|
|
|
|
- `allowRegenerateSearch: true`
|
|
|
|
Treat regeneration as part of the search contract, not as an implementation footnote.
|
|
|
|
## Search and LLM are related but not identical
|
|
|
|
The LLM system and the embedding system are adjacent, but they are not the same thing.
|
|
|
|
- `llm.providers` drive chat/completion features
|
|
- `embedding.providers` drive vector search enrichment
|
|
- org/user budgets affect LLM usage workflows
|
|
- search design still needs its own retrieval and operator decisions
|
|
|
|
Do not assume that enabling editor AI automatically defines a sound search architecture.
|
|
|
|
## Anti-patterns
|
|
|
|
- leaving search unspecified and hoping auto-fallback is “good enough”
|
|
- enabling vector search without a real provider/runtime plan
|
|
- forgetting text indexes for `mode: text`
|
|
- enabling enrichment modes without a regeneration story
|
|
- mixing editor AI decisions with search decisions until neither is clear
|
|
|
|
## Verification checklist
|
|
|
|
After search-related changes, verify all of these:
|
|
|
|
1. the project has an explicit yes/no search decision
|
|
2. server-side embedding providers exist when vector search is configured
|
|
3. required text or search indexes exist
|
|
4. `?q=` and `?qName=` behavior matches the intended search contract
|
|
5. regeneration behavior is defined for enrichment-based modes
|
|
|
|
## What an LLM should inspect first
|
|
|
|
When asked to add or review search on this starter, inspect in this order:
|
|
|
|
1. `tibi-server/docs/04-collections.md`
|
|
2. `tibi-server/docs/02-configuration.md`
|
|
3. existing collection `search:` config
|
|
4. whether the project needs keyword, fuzzy, semantic, or no search
|
|
5. operator expectations for regeneration and provider secrets
|
|
|
|
This prevents over-engineered vector setups and under-specified search behavior.
|