Elasticsearch 搜索打分计算原理浅析
admin
2023-02-16 13:40:04
0

搜索打分计算几个关键词

  • TF: token frequency ,某个搜索字段分词后再document中字段(待搜索的字段)中出现的次数

  • IDF:inverse document frequency,逆文档频率,某个搜索的字段在所有document中出现的次数取反

  • TFNORM:token frequency normalized,词频归一化
  • BM25:算法:(freq + k1 * (1 - b + b * dl / avgdl))

两个文档如下:

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "321697",
        "_score" : 6.6273837,
        "_source" : {
          "title" : "Steve Jobs"
      }
}
{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "23706",
        "_score" : 6.0948296,
        "_source" : {
          "title" : "All About Steve"
      }
}

如果我们通过titlematch查询

GET /movies/_search
{
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

那么从打分结果就可以看出第一个文档打分高于第二个,这个具体原因是:

TF方面看在带搜索字段上出现的频率一致

IDF方面看在整个文档中出现的频率一致

TFNORM方面则不一样了,第一个文档中该词占比为1/2,第二个文档中该词占比为1/3,故而第一个文档在该搜索下打分比第二个索引高,所以ES算法时使用了TFNORM计算方式freq / (freq + k1 * (1 - b + b * dl / avgdl))

最后的ES中的TF算法融合了词频归一化BM25

如果我们要查看具体Elasticsearch一个打分算法,则可以通过如下命令展示

GET /movies/_search
{
  // 和MySQL的执行计划类似
  "explain": true, 
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

执行结果,查看其中一个

{
    "_shard": "[movies][1]",
    "_node": "pqNhgutvQfqcLqLEzIDnbQ",
    "_index": "movies",
    "_type": "_doc",
    "_id": "321697",
    "_score": 6.6273837,
    "_source": {
        "overview": "Set backstage at three iconic product launches and ending in 1998 with the unveiling of the iMac, Steve Jobs takes us behind the scenes of the digital revolution to paint an intimate portrait of the brilliant man at its epicenter.",
        "voteAverage": 6.8,
        "keywords": [
            {
                "id": 5565,
                "name": "biography"
            },
            {
                "id": 6104,
                "name": "computer"
            },
            {
                "id": 15300,
                "name": "father daughter relationship"
            },
            {
                "id": 157935,
                "name": "apple computer"
            },
            {
                "id": 161160,
                "name": "steve jobs"
            },
            {
                "id": 185722,
                "name": "based on true events"
            }
        ],
        "releaseDate": "2015-01-01T00:00:00.000Z",
        "runtime": 122,
        "originalLanguage": "en",
        "title": "Steve Jobs",
        "productionCountries": [
            {
                "iso_3166_1": "US",
                "name": "United States of America"
            }
        ],
        "revenue": 34441873,
        "genres": [
            {
                "id": 18,
                "name": "Drama"
            },
            {
                "id": 36,
                "name": "History"
            }
        ],
        "originalTitle": "Steve Jobs",
        "popularity": 53.670525,
        "tagline": "Can a great man be a good man?",
        "spokenLanguages": [
            {
                "iso_639_1": "en",
                "name": "English"
            }
        ],
        "id": 321697,
        "voteCount": 1573,
        "productionCompanies": [
            {
                "name": "Universal Pictures",
                "id": 33
            },
            {
                "name": "Scott Rudin Productions",
                "id": 258
            },
            {
                "name": "Legendary Pictures",
                "id": 923
            },
            {
                "name": "The Mark Gordon Company",
                "id": 1557
            },
            {
                "name": "Management 360",
                "id": 4220
            },
            {
                "name": "Cloud Eight Films",
                "id": 6708
            }
        ],
        "budget": 30000000,
        "homepage": "http://www.stevejobsthefilm.com",
        "status": "Released"
    },
    -          }
                ]
            }
        ]
    }
}

此时可以看到结果多出了以下的一组数据(执行计划)

{
    "_explanation": {
        "value": 6.6273837,
        // title字段值steve在所有匹配的1526个文档中的权重
        "description": "weight(title:steve in 1526) [PerFieldSimilarity], result of:",
        "details": [
            {
                // value = idf.value * tf.value * 2.2
                // 6.6273837 = 6.4412656 * 0.46767938 * 2.2
                "value": 6.6273837,
                "description": "score(freq=1.0), product of:",
                "details": [
                    {
                        "value": 2.2,
                        // 放大因子,这个数值可以在创建索引的时候指定,默认值是2.2
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 6.4412656,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 1567,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.46767938,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            // 这块提现了BM25算法((freq + k1 * (1 - b + b * dl / avgdl)))
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            // 这块就可以提现出一个归一化的操作算法
                            {
                                "value": 2,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 2.1474154,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

相关内容

热门资讯

甘咨询:子公司推进数字化转型应... 证券之星消息,甘咨询(000779)01月14日在投资者关系平台上答复投资者关心的问题。 投资者提问...
众安保险发布“灵动”运动保险开... 本报讯 (记者袁传玺)2026年1月13日,由众安保险与分子实验室联合主办的“运动新生态:保险与体育...
热门Python库存在元数据投... Hugging Face模型中使用的热门AI和机器学习Python库存在漏洞,这些库的下载量达到数千...
俄外交部:俄在格陵兰岛问题上与... 俄罗斯外交部发言人扎哈罗娃1月15日在例行记者会上表示,在近来持续升级的丹麦自治领地格陵兰岛问题上,...
千问App,全面接入!从“会聊... 1月15日,阿里巴巴召开发布会宣布,千问App全面接入淘宝、支付宝、淘宝闪购、飞猪、高德等阿里生态业...
被围观的“贵妃出浴”,要穿上衣... 近日,多名网友发帖称,西安临潼华清池景区内的“贵妃出浴”雕像因袒露上半身,存在“不雅观”“败坏社会风...
宁波鄞州区建设“即学即测”科学... 本报讯(记者 史望颖 通讯员 项雪婷)近日,宁波市鄞州区举行中考科学实验操作首轮模拟测试。与以往不同...
金立群卸任,邹加怡将接任 2026年1月15日,“亚洲基础设施投资银行”微信公众号发布消息《Thank you, Presid...
中杭监测申请控制系统节能环境舱... 国家知识产权局信息显示,中杭监测技术研究院有限公司申请一项名为“一种控制系统节能环境舱及其能耗节能方...
美团也要卖车了 作者 | 第一财经 葛慧继京东之后,又一互联网平台美团切入汽车销售领域。2026年1月15日,上海喜...