Easysearch 字段'隐身'之谜：source_reuse 与 ignore_above 的陷阱解析

貊淀发表于 6 天前

背景问题

前阵子，社区有小伙伴在使用 Easysearch 的数据压缩功能时发现，在开启 source_reuse 和 ZSTD 后，一个字段的内容看不到了。
索引的设置如下：
{
......
"settings": {
   "index": {
   "codec": "ZSTD",
   "source_reuse": "true"
   }
},
"mappings": {
   "dynamic_templates": [
   {
      "message_field": {
         "path_match": "message",
         "mapping": {
         "norms": false,
         "type": "text"
         },
         "match_mapping_type": "string"
      }
   },
   {
      "string_fields": {
         "mapping": {
         "norms": false,
         "type": "text",
         "fields": {
            "keyword": {
               "ignore_above": 256,
               "type": "keyword"
            }
         }
         },
         "match_mapping_type": "string",
         "match": "*"
      }
   }
   ]
   ......
}然后产生的一个多字段内容能被搜索到，但是不可见。
类似于下面的这个情况：

原因分析

我们先来看看整个字段展示经历的环节：

[*]字段写入索引的时候，不仅写了 text 字段也写了 keyword 字段。
[*]keyword 字段产生倒排索引的时候，会忽略掉长度超过 ignore_above 的内容。
[*]因为开启了 source_reuse,_source 字段中与 doc_values 或倒排索引重复的部分会被去除。
[*]产生的数据文件进行了 ZSTD 压缩，进一步提高了数据的压缩效率。
[*]索引进行倒排或者 docvalue 的查询，检索到这个文档进行展示。
[*]展示的时候通过文档 id 获取 _source或者docvalues_fields的内容来展示文本，但是文本内容是空的。
其中步骤 4 中的 ZSTD 压缩，是作用于数据文件的，并不对数据内容进行修改。因此，我们来专注于其他环节。
问题复现

首先，这个字段索引的配置也是一个 es 常见的设置，并不会带来内容显示缺失的问题。
         "mapping": {
         "type": "text",
         "fields": {
            "keyword": {
               "ignore_above": 256,
               "type": "keyword"
            }
         }
         },那么，source_reuse 就成了我们可以重点排查的环节。
source 发生了什么

source_reuse 的作用描述如下：
source_reuse：启用 source_reuse 配置项能够去除 _source 字段中与 doc_values 或倒排索引重复的部分，从而有效减小索引总体大小，这个功能对日志类索引效果尤其明显。

source_reuse 支持对以下数据类型进行压缩：keyword，integer，long，short，boolean，float，half_float，double，geo_point，ip，如果是 text 类型，需要默认启用 keyword 类型的 multi-field 映射。以上类型必须启用 doc_values 映射（默认启用）才能压缩。这是一个对 _source 字段进行产品化的功能实现。为了减少索引的存储体量，简单粗暴的操作是直接将_source字段进行关闭，利用其他数据格式去存储，在查询的时候对应的利用 docvalue 或者 indexed 去展示文本内容。
那么 _source关闭后，会不会也有这样的问题呢？
测试的步骤如下：
# 1. 创建不带source的双字段索引

PUT test_source
{
"mappings": {
"_source": {
   "enabled": false
},
"properties": {
   "msg": {
   "type": "text",
   "fields": {
      "keyword": {
         "ignore_above": 256,
         "type": "keyword"
      }
   }
   }
}
}
}

# 2. 写入测试数据

POST test_source/_doc/1
{"msg":""" config contain variables, try to parse with environments
load config files: []
creating pipeline: pipeline_logging_merge
creating pipeline: ingest_pipeline_logging
creating pipeline: async_messages_merge
creating pipeline: metrics_merge
creating pipeline: request_logging_merge
creating pipeline: ingest_merged_requests
creating pipeline: async_ingest_bulk_requests
started module: pipeline
all system module are started
setup floating_ip, root privilege are required
init new queue config:e60457c6eae50a4eabbb62fc1001dccc,bulk_requests
init new queue config:e60457c6eae50a4eabbb62fc1001dccc,bulk_requests
init new queue config:e60457c6eae50a4eabbb62fc1001dccc,bulk_requests
generated new processors: indexing_merge
processing pipeline_v2: metrics_merge
generated new processors: when
processing pipeline_v2: ingest_merged_requests
generated new processors: indexing_merge
processing pipeline_v2: request_logging_merge
generated new processors: indexing_merge
processing pipeline_v2: async_messages_merge
generated new processors: bulk_indexing
processing pipeline_v2: ingest_pipeline_logging
init new queue config:1216c96eb876eee5b177d45436d0a362,gateway-pipeline-logs
generated new processors: bulk_indexing
generated new processors: indexing_merge
processing pipeline_v2: pipeline_logging_merge
processing pipeline_v2: async_ingest_bulk_requests
init badger database
floating_ip entering standby mode
init badger database
refresh low precision time in background
elasticsearch metadata was not found
metadata for is nil
started plugin: floating_ip
started plugin: force_merge
network io stats will be included for map[]
started plugin: metrics
started plugin: statsd
reuse port 0.0.0.0:7005
collecting network metrics
collecting instance metrics
init elasticsearch proxy instance: prod
generated new filters: when, elasticsearch
apply filter flow: [*] [ filters ]
apply filter flow: [*] [/{any_index}/_bulk] [ filters ]
init elasticsearch proxy instance: prod
generated new filters: request_path_limiter, elasticsearch
started plugin: gateway
all user plugin are started
all modules are started
gateway is up and running now.
elasticsearch metadata was not found
metadata for is nil
elasticsearch metadata was not found
metadata for is nil
collecting network metrics
collecting instance metrics
elasticsearch metadata was not found
metadata for is nil
elasticsearch metadata was not found
metadata for is nil
collecting network metrics
collecting instance metrics
elasticsearch metadata was not found"""}

# 3. 查询数据
GET test_source/_search此时，可以看到，存入的文档检索出来是空的

_source 字段是用于索引时传递的原始 JSON 文档主体。它本身未被索引成倒排（因此不作用于 query 阶段），只是在执行查询时用于 fetch 文档内容。
对于 text 类型，关闭_source，则字段内容自然不可被查看。
而对于 keyword 字段，查看_source也是不行的。可是 keyword 不仅存储source，还存储了 doc_values。因此，对于 keyword 字段类型，可以考虑关闭_source,使用 docvalue_fields 来查看字段内容。
测试如下：
# 1. 创建测试条件的索引
PUT test_source2
{
"mappings": {
"_source": {
   "enabled": false
},
"properties": {
   "msg": {
   "type": "keyword"

   }
}
}
}

# 2. 写入数据
POST test_source2/_doc
{"msg":"1111111"}

# 3. 使用 docvalue_fields 查询数据
POST test_source2/_search
{"docvalue_fields": ["msg"]}

# 返回结果
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
   "value": 1,
   "relation": "eq"
},
"max_score": 1,
"hits": [
   {
   "_index": "test_source2",
   "_type": "_doc",
   "_id": "yBvTj5kBvrlGDwP29avf",
   "_score": 1,
   "fields": {
      "msg": [
         "1111111"
      ]
   }
   }
]
}
}在如果是 text 类型，需要默认启用 keyword 类型的 multi-field 映射。以上类型必须启用 doc_values 映射（默认启用）才能压缩。这句介绍里，也可以看到 source_reuse 的正常使用需要 doc_values。那是不是一样使用 doc_values 进行内容展示呢？既然用于 docvalue_fields 内容展示，为什么还是内容看不了（不可见）呢？
keyword 的 ignore_above

仔细看问题场景里 keyword 的配置，它使用了 ignore_above。那么，会不会是这里的问题？
我们将 ignore_above 配置带入上面的测试，这里为了简化测试，ignore_above 配置为 3。为区分问题现象，这里两条长度不同的文本进去，一条为 11,一条为1111111，可以作为参数作用效果的对比。
# 1. 创建测试条件的索引,ignore_above 设置为3
PUT test_source3
{
"mappings": {
"_source": {
   "enabled": false
},
"properties": {
   "msg": {
   "type": "keyword",
   "ignore_above": 3
   }
}
}
}

# 2. 写入数据，
POST test_source3/_doc
{"msg":"1111111"}

POST test_source3/_doc
{"msg":"11"}

# 3. 使用 docvalue_fields 查询数据
POST test_source3/_search
{"docvalue_fields": ["msg"]}

# 返回内容
{
"took": 363,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
   "value": 2,
   "relation": "eq"
},
"max_score": 1,
"hits": [
   {
   "_index": "test_source3",
   "_type": "_doc",
   "_id": "yhvjj5kBvrlGDwP22KsG",
   "_score": 1
   },
   {
   "_index": "test_source3",
   "_type": "_doc",
   "_id": "yxvzj5kBvrlGDwP2Nav6",
   "_score": 1,
   "fields": {
      "msg": [
         "11"
      ]
   }
   }
]
}
}OK! 问题终于复现了。我们再来看看作为关键因素的 ignore_above 参数是用来干嘛的。
ignore_above：任何长度超过此整数值的字符串都不应被索引。默认值为 2147483647。默认动态映射会创建一个 ignore_above 设置为 256 的 keyword 子字段。也就是说，ignore_above 在（倒排）索引时会截取内容，防止产生的索引内容过长。
但是从测试的两个文本来看，面对在参数范围内的文档，docvalues 会正常创建，而超出参数范围的文本而忽略创建（至于这个问题背后的源码细节我们可以另外开坑再鸽，此处省略）。
那么，在 source_reuse 下，keyword 的 ignore_above 是不是起到了相同的作用呢？
我们可以在问题场景上去除 ignore_above，参数试试，来看下面的测试：
# 1. 创建测试条件的索引,使用 source_reuse，设置 ignore_above 为3
PUT test_source4
{
"settings": {
"index": {
   "source_reuse": "true"
}
},
"mappings": {
"properties": {
   "msg": {
   "type": "text",
   "fields": {
      "keyword": {
         "ignore_above": 3,
         "type": "keyword"
      }
   }
   }
}
}
}

# 2. 写入数据
POST test_source4/_doc
{"msg":"1111111"}

POST test_source4/_doc
{"msg":"11"}

# 3. 使用 docvalue_fields 查询数据
POST test_source4/_search

# 返回内容
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
   "value": 2,
   "relation": "eq"
},
"max_score": 1,
"hits": [
   {
   "_index": "test_source4",
   "_type": "_doc",
   "_id": "zBv2j5kBvrlGDwP2_au-",
   "_score": 1,
   "_source": {}
   },
   {
   "_index": "test_source4",
   "_type": "_doc",
   "_id": "zRv2j5kBvrlGDwP2_qsO",
   "_score": 1,
   "_source": {
      "msg": "11"
   }
   }
]
}
}可以看到，数据“不可见”的问题被完整的复现了。
小结

从上面一系列针对数据“不可见”问题的测试，我们可以总结以下几点：

[*]在 source_reuse 的压缩使用中，keyword 字段的 ignore_ablve 参数尽量使用默认值，不要进行过短的设置（这个 tip 已补充在 Easysearch 文档中）。
[*]在 source_reuse 是对数据压缩常见方法-关闭 source 字段的产品化处理，在日志压缩场景中有效且便捷，可以考虑多加利用。
[*]keyword 的 ignore_above 参数，不仅超出长度范围不进行倒排索引，也不会写入 docvalues。
特别感谢：社区@牛牪犇群
更多 Easysearch 资料请查看官网文档。
作者：金多安，极限科技（INFINI Labs）搜索运维专家，Elastic 认证专家，搜索客社区日报责任编辑。一直从事与搜索运维相关的工作，日常会去挖掘 ES / Lucene 方向的搜索技术原理，保持搜索相关技术发展的关注。
原文：https://infinilabs.cn/blog/2025/invisibility-in-easysearch-field/

来源：程序园用户自行投稿发布，如果侵权，请联系站长删除
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

热琢发表于 5 天前

谢谢楼主提供！

页: [1]

程序园's Archiver

Easysearch 字段'隐身'之谜：source_reuse 与 ignore_above 的陷阱解析