找回密码
 立即注册
首页 业界区 科技 喂饭级教程 —— 基于 OceanBase seekdb 构建 RAG 应用 ...

喂饭级教程 —— 基于 OceanBase seekdb 构建 RAG 应用

靳谷雪 3 天前

本文又是一篇喂饭级教程,为大家展示通过 OceanBase seekdb 构建 RAG(检索增强生成)系统的详细步骤。

1.jpeg

RAG 系统结合了检索系统和生成模型,可根据给定提示生成新文本。系统首先使用 seekdb 的原生向量搜索功能从语料库中检索相关文档,然后使用生成模型根据检索到的文档生成新文本。

前提条件

  • 已安装 Python 3.11 或以上版本
  • 已安装 uv
  • 已准备好 LLM API Key

准备工作

克隆代码

  1. git clone https://github.com/oceanbase/pyseekdb.git
  2. cd pyseekdb/demo/rag
复制代码

设置环境

安装依赖

基础安装(适用于 defaultapi embedding 类型):

  1. uv sync
复制代码

本地模型(适用于 local embedding 类型):

  1. uv sync --extra local
复制代码

提示:

  • local 额外依赖包含 sentence-transformers 及相关依赖(约 2-3 GB)。
  • 如果您在中国大陆,可以使用国内镜像源加速下载:
    • 基础安装(清华源):uv sync --index-url https://pypi.tuna.tsinghua.edu.cn/simple
    • 基础安装(阿里源):uv sync --index-url https://mirrors.aliyun.com/pypi/simple
    • 本地模型(清华源):uv sync --extra local --index-url https://pypi.tuna.tsinghua.edu.cn/simple
    • 本地模型(阿里源):uv sync --extra local --index-url https://mirrors.aliyun.com/pypi/simple

设置环境变量

步骤一:复制环境变量模板

cp .env.example .env

步骤二:编辑 .env 文件,设置环境变量

本系统支持三种 Embedding 函数类型,您可以根据需求选择:

  1. default(默认,推荐新手使用)
  • 使用 pyseekdb 自带的 DefaultEmbeddingFunction(基于 ONNX)
  • 首次使用会自动下载模型,无需配置 API Key
  • 适合本地开发和测试
  1. local(本地模型)
  • 使用自定义的 sentence-transformers 模型
  • 需要安装 sentence-transformers
  • 可配置模型名称和设备(CPU/GPU)
  1. api(API 服务)
  • 使用 OpenAI 兼容的 Embedding API(如 DashScope、OpenAI 等)
  • 需要配置 API Key 和模型名称
  • 适合生产环境

以下使用通义千问作为示例(使用 api 类型):

  1. # Embedding Function 类型:api, local, default
  2. EMBEDDING_FUNCTION_TYPE=api
  3. # LLM 配置(用于生成答案)
  4. OPENAI_API_KEY=sk-your-dashscope-key
  5. OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
  6. OPENAI_MODEL_NAME=qwen-plus
  7. # Embedding API 配置(仅在 EMBEDDING_FUNCTION_TYPE=api 时需要)
  8. EMBEDDING_API_KEY=sk-your-dashscope-key
  9. EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
  10. EMBEDDING_MODEL_NAME=text-embedding-v4
  11. # 本地模型配置(仅在 EMBEDDING_FUNCTION_TYPE=local 时需要)
  12. SENTENCE_TRANSFORMERS_MODEL_NAME=all-mpnet-base-v2
  13. SENTENCE_TRANSFORMERS_DEVICE=cpu
  14. # seekdb 配置
  15. SEEKDB_DIR=./data/seekdb_rag
  16. SEEKDB_NAME=test
  17. COLLECTION_NAME=embeddings
复制代码

环境变量说明:

变量名 说明 默认值/示例值 必需条件
EMBEDDING_FUNCTION_TYPE Embedding 函数类型 default (可选:api , local , default 必须设置
OPENAI_API_KEY LLM API Key(支持 OpenAI、通义千问等兼容服务) 必须设置 必须设置(用于生成答案)
OPENAI_BASE_URL LLM API 基础 URL https://dashscope.aliyuncs.com/compatible-mode/v1[1] 可选
OPENAI_MODEL_NAME 语言模型名称 qwen-plus 可选
EMBEDDING_API_KEY Embedding API Key - EMBEDDING_FUNCTION_TYPE=api 时必需
EMBEDDING_BASE_URL Embedding API 基础 URL https://dashscope.aliyuncs.com/compatible-mode/v1[2] EMBEDDING_FUNCTION_TYPE=api 时可选
EMBEDDING_MODEL_NAME Embedding 模型名称 text-embedding-v4 EMBEDDING_FUNCTION_TYPE=api 时必需
SENTENCE_TRANSFORMERS_MODEL_NAME 本地模型名称 all-mpnet-base-v2 EMBEDDING_FUNCTION_TYPE=local 时可选
SENTENCE_TRANSFORMERS_DEVICE 运行设备 cpu EMBEDDING_FUNCTION_TYPE=local 时可选
SEEKDB_DIR seekdb 数据库目录 ./data/seekdb_rag 可选
SEEKDB_NAME 数据库名称 test 可选
COLLECTION_NAME 嵌入表名称 embeddings 可选

提示:

  • 如果使用 default 类型,只需配置 EMBEDDING_FUNCTION_TYPE=default 和 LLM 相关变量即可。
  • 如果使用 api 类型,需要额外配置 Embedding API 相关变量。
  • 如果使用 local 类型,需要安装 sentence-transformers 库,并可选择配置模型名称。

主要使用的模块

初始化 LLM 客户端

我们通过加载环境变量来初始化 LLM 客户端:

  1. def get_llm_client() -> OpenAI:
  2. """Initialize LLM client using OpenAI-compatible API."""
  3. return OpenAI(
  4. api_key=os.getenv("OPENAI_API_KEY"),
  5. base_url=os.getenv("OPENAI_BASE_URL"),
  6. )
复制代码

创建数据库连接

  1. def get_seekdb_client(db_dir: str = "./seekdb_rag", db_name: str = "test"):
  2. """Initialize seekdb client (embedded mode)."""
  3. cache_key = (db_dir, db_name)
  4. if cache_key not in _client_cache:
  5. print(f"Connecting to seekdb: path={db_dir}, database={db_name}")
  6. _client_cache[cache_key] = Client(path=db_dir, database=db_name)
  7. print("seekdb client connected successfully")
  8. return _client_cache[cache_key]
复制代码

自定义嵌入模型的工厂模式

.env 文件中可以通过配置 EMBEDDING_FUNCTION_TYPE 使用不同的 embedding_function。您也可以参考这个例子自定义您的 embedding_function

  1. from pyseekdb import EmbeddingFunction, DefaultEmbeddingFunction
  2. from typing import List, Union
  3. import os
  4. from openai import OpenAI
  5. Documents = Union[str, List[str]]
  6. Embeddings = List[List[float]]
  7. class SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction[Documents]):
  8. """
  9. A custom embedding function using sentence-transformers with a specific model.
  10. """
  11. def __init__(self, model_name: str = "all-mpnet-base-v2", device: str = "cpu"):# TODO: your own model name and device
  12. """
  13. Initialize the sentence-transformer embedding function.
  14. Args:
  15. model_name: Name of the sentence-transformers model to use
  16. device: Device to run the model on ('cpu' or 'cuda')
  17. """
  18. self.model_name = model_name or os.environ.get('SENTENCE_TRANSFORMERS_MODEL_NAME')
  19. self.device = device or os.environ.get('SENTENCE_TRANSFORMERS_DEVICE')
  20. self._model = None
  21. self._dimension = None
  22. def _ensure_model_loaded(self):
  23. """Lazy load the embedding model"""
  24. if self._model isNone:
  25. try:
  26. from sentence_transformers import SentenceTransformer
  27. self._model = SentenceTransformer(self.model_name, device=self.device)
  28. # Get dimension from model
  29. test_embedding = self._model.encode(["test"], convert_to_numpy=True)
  30. self._dimension = len(test_embedding[0])
  31. except ImportError:
  32. raise ImportError(
  33. "sentence-transformers is not installed. "
  34. "Please install it with: pip install sentence-transformers"
  35. )
  36. @property
  37. def dimension(self) -> int:
  38. """Get the dimension of embeddings produced by this function"""
  39. self._ensure_model_loaded()
  40. return self._dimension
  41. def __call__(self, input: Documents) -> Embeddings:
  42. """
  43. Generate embeddings for the given documents.
  44. Args:
  45. input: Single document (str) or list of documents (List[str])
  46. Returns:
  47. List of embedding vectors
  48. """
  49. self._ensure_model_loaded()
  50. # Handle single string input
  51. if isinstance(input, str):
  52. input = [input]
  53. # Handle empty input
  54. ifnot input:
  55. return []
  56. # Generate embeddings
  57. embeddings = self._model.encode(
  58. input,
  59. convert_to_numpy=True,
  60. show_progress_bar=False
  61. )
  62. # Convert numpy arrays to lists
  63. return [embedding.tolist() for embedding in embeddings]
  64. class OpenAIEmbeddingFunction(EmbeddingFunction[Documents]):
  65. """
  66. A custom embedding function using Embedding API.
  67. """
  68. def __init__(self, model_name: str = "", api_key: str = "", base_url: str = ""):
  69. """
  70. Initialize the Embedding API embedding function.
  71. Args:
  72. model_name: Name of the Embedding API embedding model
  73. api_key: Embedding API key (if not provided, uses EMBEDDING_API_KEY env var)
  74. """
  75. self.model_name = model_name or os.environ.get('EMBEDDING_MODEL_NAME')
  76. self.api_key = api_key or os.environ.get('EMBEDDING_API_KEY')
  77. self.base_url = base_url or os.environ.get('EMBEDDING_BASE_URL')
  78. self._dimension = None
  79. ifnot self.api_key:
  80. raise ValueError("Embedding API key is required")
  81. def _ensure_model_loaded(self):
  82. """Lazy load the Embedding API model"""
  83. try:
  84. client = OpenAI(
  85. api_key=self.api_key,
  86. base_url=self.base_url
  87. )
  88. response = client.embeddings.create(
  89. model=self.model_name,
  90. input=["test"]
  91. )
  92. self._dimension = len(response.data[0].embedding)
  93. except Exception as e:
  94. raise ValueError(f"Failed to load Embedding API model: {e}")
  95. @property
  96. def dimension(self) -> int:
  97. """Get the dimension of embeddings produced by this function"""
  98. self._ensure_model_loaded()
  99. return self._dimension
  100. def __call__(self, input: Documents) -> Embeddings:
  101. """
  102. Generate embeddings using Embedding API.
  103. Args:
  104. input: Single document (str) or list of documents (List[str])
  105. Returns:
  106. List of embedding vectors
  107. """
  108. # Handle single string input
  109. if isinstance(input, str):
  110. input = [input]
  111. # Handle empty input
  112. ifnot input:
  113. return []
  114. # Call Embedding API
  115. client = OpenAI(
  116. api_key=self.api_key,
  117. base_url=self.base_url
  118. )
  119. response = client.embeddings.create(
  120. model=self.model_name,
  121. input=input
  122. )
  123. # Extract Embedding API embeddings
  124. embeddings = [item.embedding for item in response.data]
  125. return embeddings
  126. def create_embedding_function() -> EmbeddingFunction:
  127. embedding_function_type = os.environ.get('EMBEDDING_FUNCTION_TYPE')
  128. if embedding_function_type == "api":
  129. print("Using OpenAI Embedding API embedding function")
  130. return OpenAIEmbeddingFunction()
  131. elif embedding_function_type == "local":
  132. print("Using SentenceTransformer embedding function")
  133. return SentenceTransformerCustomEmbeddingFunction()
  134. elif embedding_function_type == "default":
  135. print("Using Default embedding function")
  136. return DefaultEmbeddingFunction()
  137. else:
  138. raise ValueError(f"Unsupported embedding function type: {embedding_function_type}")
复制代码

创建 Collection

get_or_create_collection() 方法中我们传入了 embedding_function,之后使用这个 collection 的 add()query() 方法的时候就不需要传入向量了,只需传入文本,向量会由 embedding_function 自动生成。

  1. def get_seekdb_collection(client, collection_name: str = "embeddings",
  2. embedding_function: Optional[EmbeddingFunction] = DefaultEmbeddingFunction(),
  3. drop_if_exists: bool = True):
  4. """
  5. Get or create a collection using pyseekdb's get_or_create_collection.
  6. Args:
  7. client: seekdb client instance
  8. collection_name: Name of the collection
  9. embedding_function: Embedding function (required for automatic embedding generation)
  10. drop_if_exists: Whether to drop existing collection if it exists
  11. Returns:
  12. Collection object
  13. """
  14. if drop_if_exists and client.has_collection(collection_name):
  15. print(f"Collection '{collection_name}' already exists, deleting old data...")
  16. client.delete_collection(collection_name)
  17. if embedding_function isNone:
  18. raise ValueError("embedding_function is required")
  19. # Use pyseekdb's native get_or_create_collection
  20. collection = client.get_or_create_collection(
  21. name=collection_name,
  22. embedding_function=embedding_function
  23. )
  24. print(f"Collection '{collection_name}' ready!")
  25. return collection
复制代码

核心插入数据函数

  1. def insert_embeddings(collection, data: List[Dict[str, Any]]):
  2. """
  3. Insert data into collection. Embeddings are automatically generated by collection's embedding_function.
  4. Args:
  5. collection: Collection object (must have embedding_function configured)
  6. data: List of data dictionaries containing 'text', 'source_file', 'chunk_index'
  7. """
  8. try:
  9. ids = [f"{item['source_file']}_{item.get('chunk_index', 0)}"for item in data]
  10. documents = [item['text'] for item in data]
  11. metadatas = [{'source_file': item['source_file'],
  12. 'chunk_index': item.get('chunk_index', 0)} for item in data]
  13. # Collection's embedding_function will automatically generate embeddings from documents
  14. collection.add(
  15. ids=ids,
  16. documents=documents,
  17. metadatas=metadatas
  18. )
  19. print(f"Inserted {len(data)} items successfully")
  20. except Exception as e:
  21. print(f"Error inserting data: {e}")
  22. raise
复制代码

向量相似度搜索

  1. results = collection.query(
  2. query_texts=[question],
  3. n_results=3,
  4. include=["documents", "metadatas", "distances"]
  5. )
复制代码

统计 Collection 中的数据情况

  1. def get_database_stats(collection) -> Dict[str, Any]:
  2. """Get statistics about the collection."""
  3. try:
  4. results = collection.get(limit=10000, include=["metadatas"])
  5. ids = results.get('ids', []) if isinstance(results, dict) else []
  6. metadatas = results.get('metadatas', []) if isinstance(results, dict) else []
  7. unique_files = {m.get('source_file') for m in metadatas if m and m.get('source_file')}
  8. return {
  9. "total_embeddings": len(ids),
  10. "unique_source_files": len(unique_files)
  11. }
  12. except Exception as e:
  13. print(f"Error getting database stats: {e}")
  14. return {"total_embeddings": 0, "unique_source_files": 0}
复制代码

构建 RAG 系统

本模块实现了 RAG 系统的检索功能。通过将用户提出的问题转换为嵌入向量,利用 seekdb 提供的原生向量搜索能力,快速检索出与问题最相关的文档片段,为后续的生成模型提供必要的上下文信息。

导入数据

我们使用 pyseekdb 的 SDK 文档作为示例,您也可以使用自己的 Markdown 文档或者目录。

运行数据导入脚本:

  1. # 导入单个文档
  2. uv run python seekdb_insert.py ../../README.md
  3. # 或导入目录下的所有 Markdown 文档
  4. uv run python seekdb_insert.py path/to/your_dir
复制代码

启动应用

pyseekdb/demo/rag 路径下执行如下命令,通过 Streamlit 启动应用:

  1. uv run streamlit run seekdb_app.py --server.port your_port
复制代码

使用 IP 和端口号(默认为 8501,可通过 --server.port 选项自定义)即可在浏览器中打开 RAG 界面。

提示: 如果使用 uv 作为包管理器,请在命令前加上 uv run 前缀,以确保使用正确的 Python 环境和依赖。

2.png

实际运行示例

以下是一个真实的查询示例,展示了 seekdb RAG 系统的实际效果:

用户问题:“What is seekdb's vector search?”

检索到的相关文档:

  1. 1. 30.experience-vector-retrieval.md
  2. Vector search overview In today's era of information explosion, users often need to quickly retrieve the information they need from massive amounts o...
  3. 2. 20.using-seekdb-in-python-mode.md
  4. Experience vector search SeekDB supports up to 16,000 dimensions of float-type dense vectors, sparse vectors, and various types of vector distance ca...
  5. 3. 70.experience-hybrid-vector-index.md
  6. Summary Through this tutorial, you have mastered the core features of SeekDB's hybrid vector index: Simplified usage process: Achieve semantic ret...
复制代码

使用 LLM 生成的答案:

SeekDB's vector search is a powerful feature that enables efficient and accurate retrieval of data by encoding various types of data—such as text, images, and audio—into mathematical vectors and performing searches in a vector space. This approach captures deep semantic relationships within the data, going beyond traditional keyword-based search methods to deliver more relevant results.

Key capabilities of SeekDB's vector search include:

  • High-dimensional vector support: It supports up to 16,000 dimensions for float-type dense vectors and sparse vectors.
  • Multiple distance metrics: It supports various similarity or distance calculations, including Manhattan distance, Euclidean distance, inner product, and cosine distance.
  • Advanced indexing methods: Vector indexes can be created using HNSW (Hierarchical Navigable Small World) or IVF (Iterative Virtual File), enabling fast approximate nearest neighbor searches.
  • Incremental updates: The system allows for real-time insertion and deletion of vectors without compromising search accuracy or recall.
  • Hybrid search with scalar filtering: Users can combine vector similarity search with traditional attribute-based (scalar) filtering for more precise results.
  • Flexible access interfaces: SeekDB supports SQL access via MySQL protocol clients in multiple programming languages, as well as a Python SDK.
  • Automatic embedding and hybrid indexing: With hybrid vector index features, users can store raw text directly—the system automatically converts it into vectors and builds indexes.

In summary, SeekDB's vector search provides a comprehensive, high-performance solution for semantic search, particularly valuable in AI applications involving large-scale unstructured data.

这个示例展示了:

  • ✅ 准确的信息检索:系统成功从文档中找到了相关信息
  • ✅ 多文档整合:从 3 个不同文档中提取和整合信息
  • ✅ 语义匹配:准确匹配了“vector search”相关的文档
  • ✅ 结构化回答:AI 将检索到的信息整理成清晰的结构
  • ✅ 完整性:涵盖了 seekdb 向量搜索的主要特性
  • ✅ 专业性:回答包含了技术细节和实际应用价值

检索质量分析:

  • 最相关文档 : experience-vector-retrieval.md - 向量搜索概览
  • 技术细节 : using-seekdb-in-python-mode.md - 具体的技术规格
  • 高级特性 : experience-hybrid-vector-index.md - 混合向量索引功能

快速体验

如需快速体验 seekdb RAG 系统,请参考 快速部署[3]

参考资料

[1]

https://dashscope.aliyuncs.com/compatible-mode/v1: https://dashscope.aliyuncs.com/compatible-mode/v1

[2]

https://dashscope.aliyuncs.com/compatible-mode/v1: https://dashscope.aliyuncs.com/compatible-mode/v1

[3]
快速部署: https://github.com/oceanbase/pyseekdb/blob/main/demo/rag/README_CN.md

[4]
seekdb 项目地址:https://github.com/oceanbase/seekdb


来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!

相关推荐

您需要登录后才可以回帖 登录 | 立即注册