喂饭级教程 —— 基于 OceanBase seekdb 构建 RAG 应用

靳谷雪 发表于 2025-12-5 16:10:08

<p>本文又是一篇喂饭级教程，为大家展示通过 OceanBase seekdb 构建 RAG（检索增强生成）系统的详细步骤。</p>
<p></p>
<p>RAG 系统结合了检索系统和生成模型，可根据给定提示生成新文本。系统首先使用 seekdb 的原生向量搜索功能从语料库中检索相关文档，然后使用生成模型根据检索到的文档生成新文本。</p>
<h2 id="前提条件"><strong>前提条件</strong></h2>
<ul>
<li>已安装 Python 3.11 或以上版本</li>
<li>已安装 uv</li>
<li>已准备好 LLM API Key</li>
</ul>
<h2 id="准备工作"><strong>准备工作</strong></h2>
<h3 id="克隆代码"><strong>克隆代码</strong></h3>
git clone https://github.com/oceanbase/pyseekdb.git
cd pyseekdb/demo/rag
<h3 id="设置环境"><strong>设置环境</strong></h3>
<h4 id="安装依赖"><strong>安装依赖</strong></h4>
<p>基础安装（适用于 <code>default</code> 或 <code>api</code> embedding 类型）：</p>
uv sync
<p>本地模型（适用于 <code>local</code> embedding 类型）：</p>
uv sync --extra local
<p>提示：</p>
<ul>
<li><code>local</code> 额外依赖包含 <code>sentence-transformers</code> 及相关依赖（约 2-3 GB）。</li>
<li>如果您在中国大陆，可以使用国内镜像源加速下载：
<ul>
<li>基础安装（清华源）：<code>uv sync --index-url https://pypi.tuna.tsinghua.edu.cn/simple</code></li>
<li>基础安装（阿里源）：<code>uv sync --index-url https://mirrors.aliyun.com/pypi/simple</code></li>
<li>本地模型（清华源）：<code>uv sync --extra local --index-url https://pypi.tuna.tsinghua.edu.cn/simple</code></li>
<li>本地模型（阿里源）：<code>uv sync --extra local --index-url https://mirrors.aliyun.com/pypi/simple</code></li>
</ul>
</li>
</ul>
<h4 id="设置环境变量"><strong>设置环境变量</strong></h4>
<p>步骤一：复制环境变量模板</p>
<p>cp .env.example .env</p>
<p>步骤二：编辑 <code>.env</code> 文件，设置环境变量</p>
<p>本系统支持三种 Embedding 函数类型，您可以根据需求选择：</p>
<ol>
<li><code>default</code>（默认，推荐新手使用）</li>
</ol>
<ul>
<li>使用 pyseekdb 自带的 <code>DefaultEmbeddingFunction</code>（基于 ONNX）</li>
<li>首次使用会自动下载模型，无需配置 API Key</li>
<li>适合本地开发和测试</li>
</ul>
<ol start="2">
<li><code>local</code>（本地模型）</li>
</ol>
<ul>
<li>使用自定义的 <code>sentence-transformers</code> 模型</li>
<li>需要安装 <code>sentence-transformers</code> 库</li>
<li>可配置模型名称和设备（CPU/GPU）</li>
</ul>
<ol start="3">
<li><code>api</code>（API 服务）</li>
</ol>
<ul>
<li>使用 OpenAI 兼容的 Embedding API（如 DashScope、OpenAI 等）</li>
<li>需要配置 API Key 和模型名称</li>
<li>适合生产环境</li>
</ul>
<p>以下使用通义千问作为示例（使用 <code>api</code> 类型）：</p>
# Embedding Function 类型：api, local, default
EMBEDDING_FUNCTION_TYPE=api

# LLM 配置（用于生成答案）
OPENAI_API_KEY=sk-your-dashscope-key
OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENAI_MODEL_NAME=qwen-plus

# Embedding API 配置（仅在 EMBEDDING_FUNCTION_TYPE=api 时需要）
EMBEDDING_API_KEY=sk-your-dashscope-key
EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
EMBEDDING_MODEL_NAME=text-embedding-v4

# 本地模型配置（仅在 EMBEDDING_FUNCTION_TYPE=local 时需要）
SENTENCE_TRANSFORMERS_MODEL_NAME=all-mpnet-base-v2
SENTENCE_TRANSFORMERS_DEVICE=cpu

# seekdb 配置
SEEKDB_DIR=./data/seekdb_rag
SEEKDB_NAME=test
COLLECTION_NAME=embeddings
<p>环境变量说明：</p>
<table>
<thead>
<tr>
<th ><strong>变量名</strong></th>
<th ><strong>说明</strong></th>
<th ><strong>默认值/示例值</strong></th>
<th ><strong>必需条件</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td >EMBEDDING_FUNCTION_TYPE</td>
<td >Embedding 函数类型</td>
<td ><code>default</code> （可选：<code>api</code> , <code>local</code> , <code>default</code> ）</td>
<td >必须设置</td>
</tr>
<tr>
<td >OPENAI_API_KEY</td>
<td >LLM API Key（支持 OpenAI、通义千问等兼容服务）</td>
<td >必须设置</td>
<td >必须设置（用于生成答案）</td>
</tr>
<tr>
<td >OPENAI_BASE_URL</td>
<td >LLM API 基础 URL</td>
<td >https://dashscope.aliyuncs.com/compatible-mode/v1</td>
<td >可选</td>
</tr>
<tr>
<td >OPENAI_MODEL_NAME</td>
<td >语言模型名称</td>
<td >qwen-plus</td>
<td >可选</td>
</tr>
<tr>
<td >EMBEDDING_API_KEY</td>
<td >Embedding API Key</td>
<td >-</td>
<td ><code>EMBEDDING_FUNCTION_TYPE=api</code> 时必需</td>
</tr>
<tr>
<td >EMBEDDING_BASE_URL</td>
<td >Embedding API 基础 URL</td>
<td >https://dashscope.aliyuncs.com/compatible-mode/v1</td>
<td ><code>EMBEDDING_FUNCTION_TYPE=api</code> 时可选</td>
</tr>
<tr>
<td >EMBEDDING_MODEL_NAME</td>
<td >Embedding 模型名称</td>
<td >text-embedding-v4</td>
<td ><code>EMBEDDING_FUNCTION_TYPE=api</code> 时必需</td>
</tr>
<tr>
<td >SENTENCE_TRANSFORMERS_MODEL_NAME</td>
<td >本地模型名称</td>
<td >all-mpnet-base-v2</td>
<td ><code>EMBEDDING_FUNCTION_TYPE=local</code> 时可选</td>
</tr>
<tr>
<td >SENTENCE_TRANSFORMERS_DEVICE</td>
<td >运行设备</td>
<td >cpu</td>
<td ><code>EMBEDDING_FUNCTION_TYPE=local</code> 时可选</td>
</tr>
<tr>
<td >SEEKDB_DIR</td>
<td >seekdb 数据库目录</td>
<td >./data/seekdb_rag</td>
<td >可选</td>
</tr>
<tr>
<td >SEEKDB_NAME</td>
<td >数据库名称</td>
<td >test</td>
<td >可选</td>
</tr>
<tr>
<td >COLLECTION_NAME</td>
<td >嵌入表名称</td>
<td >embeddings</td>
<td >可选</td>
</tr>
</tbody>
</table>
<p>提示：</p>
<ul>
<li>如果使用 <code>default</code> 类型，只需配置 <code>EMBEDDING_FUNCTION_TYPE=default</code> 和 LLM 相关变量即可。</li>
<li>如果使用 <code>api</code> 类型，需要额外配置 Embedding API 相关变量。</li>
<li>如果使用 <code>local</code> 类型，需要安装 <code>sentence-transformers</code> 库，并可选择配置模型名称。</li>
</ul>
<h2 id="主要使用的模块"><strong>主要使用的模块</strong></h2>
<h3 id="初始化-llm-客户端"><strong>初始化 LLM 客户端</strong></h3>
<p>我们通过加载环境变量来初始化 LLM 客户端：</p>
def get_llm_client() -> OpenAI:
"""Initialize LLM client using OpenAI-compatible API."""
return OpenAI(
   api_key=os.getenv("OPENAI_API_KEY"),
   base_url=os.getenv("OPENAI_BASE_URL"),
)
<h3 id="创建数据库连接"><strong>创建数据库连接</strong></h3>
def get_seekdb_client(db_dir: str = "./seekdb_rag", db_name: str = "test"):
"""Initialize seekdb client (embedded mode)."""
cache_key = (db_dir, db_name)
if cache_key not in _client_cache:
   print(f"Connecting to seekdb: path={db_dir}, database={db_name}")
   _client_cache = Client(path=db_dir, database=db_name)
   print("seekdb client connected successfully")
return _client_cache
<h3 id="自定义嵌入模型的工厂模式"><strong>自定义嵌入模型的工厂模式</strong></h3>
<p>在 <code>.env</code> 文件中可以通过配置 <code>EMBEDDING_FUNCTION_TYPE</code> 使用不同的 <code>embedding_function</code>。您也可以参考这个例子自定义您的 <code>embedding_function</code>。</p>
from pyseekdb import EmbeddingFunction, DefaultEmbeddingFunction
from typing import List, Union
import os
from openai import OpenAI

Documents = Union]
Embeddings = List]

class SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction):
"""
A custom embedding function using sentence-transformers with a specific model.
"""

def __init__(self, model_name: str = "all-mpnet-base-v2", device: str = "cpu"):# TODO: your own model name and device
   """
   Initialize the sentence-transformer embedding function.

   Args:
         model_name: Name of the sentence-transformers model to use
         device: Device to run the model on ('cpu' or 'cuda')
   """
   self.model_name = model_name or os.environ.get('SENTENCE_TRANSFORMERS_MODEL_NAME')
   self.device = device or os.environ.get('SENTENCE_TRANSFORMERS_DEVICE')
   self._model = None
   self._dimension = None

def _ensure_model_loaded(self):
   """Lazy load the embedding model"""
   if self._model isNone:
         try:
            from sentence_transformers import SentenceTransformer
            self._model = SentenceTransformer(self.model_name, device=self.device)
            # Get dimension from model
            test_embedding = self._model.encode(["test"], convert_to_numpy=True)
            self._dimension = len(test_embedding)
         except ImportError:
            raise ImportError(
               "sentence-transformers is not installed. "
               "Please install it with: pip install sentence-transformers"
            )

@property
def dimension(self) -> int:
   """Get the dimension of embeddings produced by this function"""
   self._ensure_model_loaded()
   return self._dimension

def __call__(self, input: Documents) -> Embeddings:
   """
   Generate embeddings for the given documents.

   Args:
         input: Single document (str) or list of documents (List)

   Returns:
         List of embedding vectors
   """
   self._ensure_model_loaded()

   # Handle single string input
   if isinstance(input, str):
         input =

   # Handle empty input
   ifnot input:
         return []

   # Generate embeddings
   embeddings = self._model.encode(
         input,
         convert_to_numpy=True,
         show_progress_bar=False
   )

   # Convert numpy arrays to lists
   return

class OpenAIEmbeddingFunction(EmbeddingFunction):
"""
A custom embedding function using Embedding API.
"""

def __init__(self, model_name: str = "", api_key: str = "", base_url: str = ""):
   """
   Initialize the Embedding API embedding function.

   Args:
         model_name: Name of the Embedding API embedding model
         api_key: Embedding API key (if not provided, uses EMBEDDING_API_KEY env var)
   """
   self.model_name = model_name or os.environ.get('EMBEDDING_MODEL_NAME')
   self.api_key = api_key or os.environ.get('EMBEDDING_API_KEY')
   self.base_url = base_url or os.environ.get('EMBEDDING_BASE_URL')
   self._dimension = None
   ifnot self.api_key:
         raise ValueError("Embedding API key is required")

def _ensure_model_loaded(self):
   """Lazy load the Embedding API model"""
   try:
         client = OpenAI(
            api_key=self.api_key,
            base_url=self.base_url
         )
         response = client.embeddings.create(
            model=self.model_name,
            input=["test"]
         )
         self._dimension = len(response.data.embedding)
   except Exception as e:
         raise ValueError(f"Failed to load Embedding API model: {e}")

@property
def dimension(self) -> int:
   """Get the dimension of embeddings produced by this function"""
   self._ensure_model_loaded()
   return self._dimension

def __call__(self, input: Documents) -> Embeddings:
   """
   Generate embeddings using Embedding API.

   Args:
         input: Single document (str) or list of documents (List)

   Returns:
         List of embedding vectors
   """
   # Handle single string input
   if isinstance(input, str):
         input =

   # Handle empty input
   ifnot input:
         return []

   # Call Embedding API
   client = OpenAI(
         api_key=self.api_key,
         base_url=self.base_url
   )
   response = client.embeddings.create(
         model=self.model_name,
         input=input
   )

   # Extract Embedding API embeddings
   embeddings =
   return embeddings

def create_embedding_function() -> EmbeddingFunction:
embedding_function_type = os.environ.get('EMBEDDING_FUNCTION_TYPE')
if embedding_function_type == "api":
   print("Using OpenAI Embedding API embedding function")
   return OpenAIEmbeddingFunction()
elif embedding_function_type == "local":
   print("Using SentenceTransformer embedding function")
   return SentenceTransformerCustomEmbeddingFunction()
elif embedding_function_type == "default":
   print("Using Default embedding function")
   return DefaultEmbeddingFunction()
else:
   raise ValueError(f"Unsupported embedding function type: {embedding_function_type}")
<h3 id="创建-collection"><strong>创建 Collection</strong></h3>
<p>在 <code>get_or_create_collection()</code> 方法中我们传入了 <code>embedding_function</code>，之后使用这个 collection 的 <code>add()</code> 和 <code>query()</code> 方法的时候就不需要传入向量了，只需传入文本，向量会由 <code>embedding_function</code> 自动生成。</p>
def get_seekdb_collection(client, collection_name: str = "embeddings",
               embedding_function: Optional = DefaultEmbeddingFunction(),
               drop_if_exists: bool = True):
"""
Get or create a collection using pyseekdb's get_or_create_collection.

Args:
   client: seekdb client instance
   collection_name: Name of the collection
   embedding_function: Embedding function (required for automatic embedding generation)
   drop_if_exists: Whether to drop existing collection if it exists

Returns:
   Collection object
"""
if drop_if_exists and client.has_collection(collection_name):
   print(f"Collection '{collection_name}' already exists, deleting old data...")
   client.delete_collection(collection_name)

if embedding_function isNone:
   raise ValueError("embedding_function is required")

# Use pyseekdb's native get_or_create_collection
collection = client.get_or_create_collection(
   name=collection_name,
   embedding_function=embedding_function
)

print(f"Collection '{collection_name}' ready!")
return collection
<h3 id="核心插入数据函数"><strong>核心插入数据函数</strong></h3>
def insert_embeddings(collection, data: List]):
"""
Insert data into collection. Embeddings are automatically generated by collection's embedding_function.

Args:
   collection: Collection object (must have embedding_function configured)
   data: List of data dictionaries containing 'text', 'source_file', 'chunk_index'
"""
try:
   ids = }_{item.get('chunk_index', 0)}"for item in data]
   documents = for item in data]
   metadatas = [{'source_file': item['source_file'],
                  'chunk_index': item.get('chunk_index', 0)} for item in data]

   # Collection's embedding_function will automatically generate embeddings from documents
   collection.add(
         ids=ids,
         documents=documents,
         metadatas=metadatas
   )

   print(f"Inserted {len(data)} items successfully")
except Exception as e:
   print(f"Error inserting data: {e}")
   raise
<h3 id="向量相似度搜索"><strong>向量相似度搜索</strong></h3>
results = collection.query(
               query_texts=,
               n_results=3,
               include=["documents", "metadatas", "distances"]
            )
<h3 id="统计-collection-中的数据情况"><strong>统计 Collection 中的数据情况</strong></h3>
def get_database_stats(collection) -> Dict:
"""Get statistics about the collection."""
try:
   results = collection.get(limit=10000, include=["metadatas"])
   ids = results.get('ids', []) if isinstance(results, dict) else []
   metadatas = results.get('metadatas', []) if isinstance(results, dict) else []

   unique_files = {m.get('source_file') for m in metadatas if m and m.get('source_file')}

   return {
         "total_embeddings": len(ids),
         "unique_source_files": len(unique_files)
   }
except Exception as e:
   print(f"Error getting database stats: {e}")
   return {"total_embeddings": 0, "unique_source_files": 0}
<h2 id="构建-rag-系统"><strong>构建 RAG 系统</strong></h2>
<p>本模块实现了 RAG 系统的检索功能。通过将用户提出的问题转换为嵌入向量，利用 seekdb 提供的原生向量搜索能力，快速检索出与问题最相关的文档片段，为后续的生成模型提供必要的上下文信息。</p>
<h3 id="导入数据"><strong>导入数据</strong></h3>
<p>我们使用 pyseekdb 的 SDK 文档作为示例，您也可以使用自己的 Markdown 文档或者目录。</p>
<p>运行数据导入脚本：</p>
# 导入单个文档
uv run python seekdb_insert.py ../../README.md

# 或导入目录下的所有 Markdown 文档
uv run python seekdb_insert.py path/to/your_dir
<h3 id="启动应用"><strong>启动应用</strong></h3>
<p>在 <code>pyseekdb/demo/rag</code> 路径下执行如下命令，通过 Streamlit 启动应用：</p>
uv run streamlit run seekdb_app.py --server.port your_port
<p>使用 IP 和端口号（默认为 <code>8501</code>，可通过 <code>--server.port</code> 选项自定义）即可在浏览器中打开 RAG 界面。</p>
<p>提示：如果使用 <code>uv</code> 作为包管理器，请在命令前加上 <code>uv run</code> 前缀，以确保使用正确的 Python 环境和依赖。</p>
<p></p>
<h2 id="实际运行示例"><strong>实际运行示例</strong></h2>
<p>以下是一个真实的查询示例，展示了 seekdb RAG 系统的实际效果：</p>
<p>用户问题：“What is seekdb's vector search?”</p>
<p>检索到的相关文档：</p>
1. 30.experience-vector-retrieval.md
Vector search overview In today's era of information explosion, users often need to quickly retrieve the information they need from massive amounts o...

2. 20.using-seekdb-in-python-mode.md
Experience vector search SeekDB supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance ca...

3. 70.experience-hybrid-vector-index.md
Summary Through this tutorial, you have mastered the core features of SeekDB's hybrid vector index: Simplified usage process: Achieve semantic ret...
<p>使用 LLM 生成的答案：</p>
<p>SeekDB's vector search is a powerful feature that enables efficient and accurate retrieval of data by encoding various types of data—such as text, images, and audio—into mathematical vectors and performing searches in a vector space. This approach captures deep semantic relationships within the data, going beyond traditional keyword-based search methods to deliver more relevant results.</p>
<p>Key capabilities of SeekDB's vector search include:</p>
<ul>
<li>High-dimensional vector support: It supports up to 16,000 dimensions for float-type dense vectors and sparse vectors.</li>
<li>Multiple distance metrics: It supports various similarity or distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance.</li>
<li>Advanced indexing methods: Vector indexes can be created using HNSW (Hierarchical Navigable Small World) or IVF (Iterative Virtual File), enabling fast approximate nearest neighbor searches.</li>
<li>Incremental updates: The system allows for real-time insertion and deletion of vectors without compromising search accuracy or recall.</li>
<li>Hybrid search with scalar filtering: Users can combine vector similarity search with traditional attribute-based (scalar) filtering for more precise results.</li>
<li>Flexible access interfaces: SeekDB supports SQL access via MySQL protocol clients in multiple programming languages, as well as a Python SDK.</li>
<li>Automatic embedding and hybrid indexing: With hybrid vector index features, users can store raw text directly—the system automatically converts it into vectors and builds indexes.</li>
</ul>
<p>In summary, SeekDB's vector search provides a comprehensive, high-performance solution for semantic search, particularly valuable in AI applications involving large-scale unstructured data.</p>
<p>这个示例展示了：</p>
<ul>
<li>✅ 准确的信息检索：系统成功从文档中找到了相关信息</li>
<li>✅ 多文档整合：从 3 个不同文档中提取和整合信息</li>
<li>✅ 语义匹配：准确匹配了“vector search”相关的文档</li>
<li>✅ 结构化回答：AI 将检索到的信息整理成清晰的结构</li>
<li>✅ 完整性：涵盖了 seekdb 向量搜索的主要特性</li>
<li>✅ 专业性：回答包含了技术细节和实际应用价值</li>
</ul>
<p>检索质量分析：</p>
<ul>
<li>最相关文档 : <code>experience-vector-retrieval.md</code> - 向量搜索概览</li>
<li>技术细节 : <code>using-seekdb-in-python-mode.md</code> - 具体的技术规格</li>
<li>高级特性 : <code>experience-hybrid-vector-index.md</code> - 混合向量索引功能</li>
</ul>
<h2 id="快速体验"><strong>快速体验</strong></h2>
<p>如需快速体验 seekdb RAG 系统，请参考 <strong>快速部署</strong>。</p>
<p><strong>参考资料</strong></p>
<p></p>
<p>https://dashscope.aliyuncs.com/compatible-mode/v1: <em>https://dashscope.aliyuncs.com/compatible-mode/v1</em></p>
<p></p>
<p>https://dashscope.aliyuncs.com/compatible-mode/v1: <em>https://dashscope.aliyuncs.com/compatible-mode/v1</em></p>
<p><br>
快速部署: <em>https://github.com/oceanbase/pyseekdb/blob/main/demo/rag/README_CN.md</em></p>
<p><br>
seekdb 项目地址：https://github.com/oceanbase/seekdb</p><br>来源：程序园用户自行投稿发布，如果侵权，请联系站长删除<br>免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

柏雅云 发表于 2025-12-9 06:08:01

东西不错很实用谢谢分享

祝娜娜 发表于 2025-12-16 10:02:57

感谢分享，下载保存了，貌似很强大

挫莉虻 发表于 2026-1-12 09:09:46

感谢分享，学习下。

榷另辑 发表于 2026-1-18 00:12:23

感谢分享

缢闸发表于 2026-1-18 10:43:59

感谢，下载保存了

尹心菱 发表于 5 天前

谢谢分享，试用一下

万妙音 发表于 4 天前

鼓励转贴优秀软件安全工具和文档！

呈步发表于 3 天前

感谢分享

莘度发表于前天 03:24

喜欢鼓捣这些软件，现在用得少，谢谢分享！

澹台忆然 发表于前天 04:24

前排留名，哈哈哈

捡嫌发表于 16 小时前

谢谢分享，辛苦了

仁夹篇 发表于 6 小时前

这个有用。

页: [1]

程序园's Archiver

喂饭级教程 —— 基于 OceanBase seekdb 构建 RAG 应用