Tutorial4: Bge Embedding¶

本节旨在使用 bge-m3 模型给出句子的向量表示,并计算句子的语义相似度。

分以下几步来实现:

  1. 环境安装与应用创建

  2. 下载模型

  3. 模型的使用

    3.1 稠密检索

    3.2 稀疏检索

    3.3 多向量检索

    3.4 加权语义相似度

1. 环境安装与应用创建¶

首先在联网的命令行中创建conda环境:

conda create -n tutorial4 python=3.9
conda activate tutorial4
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install peft numpy==1.26.4 matplotlib==3.8.4 ipykernel==6.29.5 transformers==4.42.4

( pytorch 版本需与 cuda 版本对应,请查看版本对应网站:https://pytorch.org/get-started/previous-versions ,通过 nvidia-smi 命令可查看 cuda 版本)

然后创建JupyterLab应用, Conda环境名请填写tutorial4, 硬件资源建议使用1张GPU运行。创建应用后, 进入应用并打开本文件。

CUDA Version: 12.1; Torch Version: 2.3.1

2. 下载模型¶

建议在联网的命令行中下载模型,命令执行位置在当前文件所在的文件夹。

# 如果以下目录存在, 可以直接复制:
cp -r /lustre/public/tutorial/models/models--BAAI--bge-m3/ ./

# 否则请自行下载:
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download BAAI/bge-m3 --local-dir models--BAAI--bge-m3

(建议使用 tmux 工具进行数据下载。tmux(Terminal Multiplexer)是一个终端复用器,它允许用户在一个单一终端会话中运行多个终端会话,并且它的会话可以在不同的时间和地点断开和重新连接,非常适合远程工作和需要长时间运行的任务。关于 tmux 的安装和介绍参考:https://tmuxcheatsheet.com/how-to-install-tmux ; 使用参考: https://tmuxcheatsheet.com)

3. 模型使用¶

在data shell 中执行:

pip install -U FlagEmbedding

3.1 稠密检索¶

稠密检索使用低维、密集的向量表示文本数据,将文本嵌入到连续的向量空间中,能够捕捉语义相似性,适合处理自然语言处理(NLP)任务中的模糊查询和复杂语义关系。运行下方代码。

In [ ]:
from FlagEmbedding import BGEM3FlagModel

# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3',  
                       use_fp16=True)

# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
               " lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a "
               "set of documents based on the query terms "
               "appearing in each document"]

# 计算 Embedding
embeddings_1 = model.encode(sentences_1, 
                            batch_size=12, 
                            max_length=8192, 
                            )['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']

# 计算相似度
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# 结果应该是:
# [[0.6265, 0.3477], [0.3499, 0.678 ]]

3.2 稀疏检索¶

稀疏检索使用高维、稀疏的向量表示文本,其中大部分特征值为零,其计算效率高,易于解释,适合处理短文本和关键词匹配。运行下方代码。

In [ ]:
from FlagEmbedding import BGEM3FlagModel

# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3',  use_fp16=True) 

# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
               " lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a "
               "set of documents based on the query terms "
               "appearing in each document"]

# 通过 lexical mathcing 计算相似度
output_1 = model.encode(
    sentences_1, return_dense=True, return_sparse=True,
    return_colbert_vecs=False)
output_2 = model.encode(
    sentences_2, return_dense=True, return_sparse=True,
    return_colbert_vecs=False)

lexical_scores = model.compute_lexical_matching_score(
    output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0.19554901123046875
print(model.compute_lexical_matching_score(
    output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
# 0.0

# 查看每个 token 的 weight:
print(model.convert_id_to_token(output_1['lexical_weights']))
# [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252,
# 'M': 0.1702, '3': 0.2695, '?': 0.04092}, 
#  {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633,
# 'BM': 0.2515, '25': 0.3335}]

3.3 多向量检索¶

多向量检索是一种混合方法,结合了稠密和稀疏检索的优点,使用多个向量来表示一个文档或查询。运行下方代码。

In [ ]:
from FlagEmbedding import BGEM3FlagModel

# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3',  use_fp16=True) 

# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
               " lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a "
               "set of documents based on the query terms "
               "appearing in each document"]

# 通过 colbert 计算相似度
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True,
                        return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True,
                        return_colbert_vecs=True)

print(model.colbert_score(
    output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
print(model.colbert_score(
    output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
# 0.7797
# 0.4620

3.4 加权语义相似度¶

计算三种检索的加权平均值:

In [ ]:
from FlagEmbedding import BGEM3FlagModel

# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3',  use_fp16=True) 

# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
               " lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a "
               "set of documents based on the query terms "
               "appearing in each document"]

sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]

# 计算混合相似度
# w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
print(model.compute_score(sentence_pairs, 
                          max_passage_length=128, 
                          weights_for_different_modes=[0.4, 0.2, 0.4])) 

作者: 黎颖; 龙汀汀

联系方式: yingliclaire@pku.edu.cn; l.tingting@pku.edu.cn