How to understand span-not query in Elasticsearch

Posted by Echo Yuan on May 26, 2021

最近在学习Elasticsearch,在看到span not query的时候一头雾水,官方也没给出更详细的例子。如鲠在喉,难受。

经过一番搜索和实践,得出了一点儿经验。

先定义Mapping
PUT /span_not_query_test

{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}
造两条数据
PUT /span_not_query_test/_doc/1

{
  "content":"the quick red fox jumps over the sleepy cat"
}

PUT /span_not_query_test/_doc/2

{
  "content":"the quick brown fox jumps over the lazy dog"
}
例子1
POST /span_not_query_test/_search

{
  "query": {
    "span_not": {
      "include": {
        "span_term": {
          "content": {
            "value": "quick"
          }
        }
      },
      "exclude": {
        "span_term": {
          "content": {
            "value": "the"
          }
        }
      }
    }
  }
}

结论:

exclude.span_term.content.value == quick,无文档返回;否则,会返回两个文档。

例子2
POST /span_not_query_test/_search

{
  "query": {
    "span_not": {
      "include": {
        "span_near": {
          "clauses": [
            {
              "span_term": {
                "content": {
                  "value": "quick"
                }
              }
            },
            {
              "span_term": {
                "content": {
                  "value": "over"
                }
              }
            }
          ],
          "slop": 3,
          "in_order": true
        }
      },
      "exclude": {
        "span_term": {
          "content": {
            "value": "lazy"
          }
        }
      }
    }
  }
}

实验结果如下:

  1. exclude.span_term.content.value in [quick, fox, jumps, over],无文档返回;
  2. exclude.span_term.content.value == red,只返回了the quick brown fox jumps over the lazy dog这一个文档;
  3. exclude.span_term.content.value == brown,只返回了the quick red fox jumps over the sleepy cat这一个文档;
  4. exclude.span_term.content.value in [the, over, lazy, dog, sleepy, cat],即如果是content中quick之前的任意terms或over之后的任意terms,都会返回这两个文档。

结论:

excludemust_not的工作方式不一样,它并不会把符合自身条件的docs查询出来然后再从include的结果中remove掉它们,而只是在条件这一层面上判断是否包含在include的条件范围内。

当然,最好的方式还是去看Elasticsearch和Lucene的SpanNotQuery的源码。

参考资料
  1. https://stackoverflow.com/questions/24260103/spannotquery-giving-unexpected-results-exclude-is-ignored
  2. https://elasticsearch.cn/article/13677