Elasticsearch 搜索打分计算原理浅析
admin
2023-02-16 13:40:04
0

搜索打分计算几个关键词

  • TF: token frequency ,某个搜索字段分词后再document中字段(待搜索的字段)中出现的次数

  • IDF:inverse document frequency,逆文档频率,某个搜索的字段在所有document中出现的次数取反

  • TFNORM:token frequency normalized,词频归一化
  • BM25:算法:(freq + k1 * (1 - b + b * dl / avgdl))

两个文档如下:

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "321697",
        "_score" : 6.6273837,
        "_source" : {
          "title" : "Steve Jobs"
      }
}
{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "23706",
        "_score" : 6.0948296,
        "_source" : {
          "title" : "All About Steve"
      }
}

如果我们通过titlematch查询

GET /movies/_search
{
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

那么从打分结果就可以看出第一个文档打分高于第二个,这个具体原因是:

TF方面看在带搜索字段上出现的频率一致

IDF方面看在整个文档中出现的频率一致

TFNORM方面则不一样了,第一个文档中该词占比为1/2,第二个文档中该词占比为1/3,故而第一个文档在该搜索下打分比第二个索引高,所以ES算法时使用了TFNORM计算方式freq / (freq + k1 * (1 - b + b * dl / avgdl))

最后的ES中的TF算法融合了词频归一化BM25

如果我们要查看具体Elasticsearch一个打分算法,则可以通过如下命令展示

GET /movies/_search
{
  // 和MySQL的执行计划类似
  "explain": true, 
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

执行结果,查看其中一个

{
    "_shard": "[movies][1]",
    "_node": "pqNhgutvQfqcLqLEzIDnbQ",
    "_index": "movies",
    "_type": "_doc",
    "_id": "321697",
    "_score": 6.6273837,
    "_source": {
        "overview": "Set backstage at three iconic product launches and ending in 1998 with the unveiling of the iMac, Steve Jobs takes us behind the scenes of the digital revolution to paint an intimate portrait of the brilliant man at its epicenter.",
        "voteAverage": 6.8,
        "keywords": [
            {
                "id": 5565,
                "name": "biography"
            },
            {
                "id": 6104,
                "name": "computer"
            },
            {
                "id": 15300,
                "name": "father daughter relationship"
            },
            {
                "id": 157935,
                "name": "apple computer"
            },
            {
                "id": 161160,
                "name": "steve jobs"
            },
            {
                "id": 185722,
                "name": "based on true events"
            }
        ],
        "releaseDate": "2015-01-01T00:00:00.000Z",
        "runtime": 122,
        "originalLanguage": "en",
        "title": "Steve Jobs",
        "productionCountries": [
            {
                "iso_3166_1": "US",
                "name": "United States of America"
            }
        ],
        "revenue": 34441873,
        "genres": [
            {
                "id": 18,
                "name": "Drama"
            },
            {
                "id": 36,
                "name": "History"
            }
        ],
        "originalTitle": "Steve Jobs",
        "popularity": 53.670525,
        "tagline": "Can a great man be a good man?",
        "spokenLanguages": [
            {
                "iso_639_1": "en",
                "name": "English"
            }
        ],
        "id": 321697,
        "voteCount": 1573,
        "productionCompanies": [
            {
                "name": "Universal Pictures",
                "id": 33
            },
            {
                "name": "Scott Rudin Productions",
                "id": 258
            },
            {
                "name": "Legendary Pictures",
                "id": 923
            },
            {
                "name": "The Mark Gordon Company",
                "id": 1557
            },
            {
                "name": "Management 360",
                "id": 4220
            },
            {
                "name": "Cloud Eight Films",
                "id": 6708
            }
        ],
        "budget": 30000000,
        "homepage": "http://www.stevejobsthefilm.com",
        "status": "Released"
    },
    -          }
                ]
            }
        ]
    }
}

此时可以看到结果多出了以下的一组数据(执行计划)

{
    "_explanation": {
        "value": 6.6273837,
        // title字段值steve在所有匹配的1526个文档中的权重
        "description": "weight(title:steve in 1526) [PerFieldSimilarity], result of:",
        "details": [
            {
                // value = idf.value * tf.value * 2.2
                // 6.6273837 = 6.4412656 * 0.46767938 * 2.2
                "value": 6.6273837,
                "description": "score(freq=1.0), product of:",
                "details": [
                    {
                        "value": 2.2,
                        // 放大因子,这个数值可以在创建索引的时候指定,默认值是2.2
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 6.4412656,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 1567,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.46767938,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            // 这块提现了BM25算法((freq + k1 * (1 - b + b * dl / avgdl)))
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            // 这块就可以提现出一个归一化的操作算法
                            {
                                "value": 2,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 2.1474154,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

相关内容

热门资讯

美官员:美商船穿越霍尔木兹海峡... 当地时间5月5日,央视记者获悉,两艘搭载美军安全队员的美国商船在通过霍尔木兹海峡期间曾遭伊朗袭击。美...
日本参议员:对俄制裁损害日本国... 正在俄罗斯访问的日本国会参议员铃木宗男5月5日对媒体表示,日本对俄制裁同样损害了日本国家利益。铃木说...
美国务卿称美国正推进对伊朗“极... △美国国务卿鲁比奥(资料图)当地时间5月5日,美国国务卿鲁比奥在媒体简报会上称,美军正在霍尔木兹海峡...
伊朗外交部:敦促美方在外交问题... △伊朗外交部发言人巴加埃(资料图)据伊朗方面5月5日消息,伊朗外交部发言人巴加埃就当前伊美谈判进程表...
就在明晚,“极大雨”要来了! 据新华社消息,拥有哈雷彗星“血统”的宝瓶座η流星雨将于5月6日迎来极大,流星雨爱好者可在6日、7日夜...
原创 O... OPPO新机继续丰富,前有OPPO Find X9 Ultra、旗舰平板、小屏幕平板等,现有OPPO...
馆校合作丨南充科技馆走进仪陇县... 馆校合作 南充科技馆走进 NCSTM 仪陇县实验学校 天府科普研学游 4月29日上午,南充科技馆科普...
我国本土发现的首块月球陨石有重... 我国本土发现的首块月球陨石揭示了月球两次关键地质事件,并发现一种月球新矿物。 2026年世界地球日,...
马斯克的GPU也在摸鱼?狂囤几... 新智元报道 编辑:元宇 【新智元导读】马斯克囤了几十万张卡,结果只跑了11%?据媒体报道,xAI的...
原创 特... 4月24日,白宫以总统人事办公室的名义,向美国国家科学委员会的22名在任委员群发了一封冷冰冰的电子邮...