1. 环境安装与应用创建¶
首先在联网的命令行中创建conda环境:
conda create -n tutorial4 python=3.9
conda activate tutorial4
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install peft numpy==1.26.4 matplotlib==3.8.4 ipykernel==6.29.5 transformers==4.42.4
( pytorch 版本需与 cuda 版本对应,请查看版本对应网站:https://pytorch.org/get-started/previous-versions ,通过 nvidia-smi 命令可查看 cuda 版本)
然后创建JupyterLab应用, Conda环境名
请填写tutorial4
, 硬件资源建议使用1张GPU运行。创建应用后, 进入应用并打开本文件。
CUDA Version: 12.1; Torch Version: 2.3.1
2. 下载模型¶
建议在联网的命令行中下载模型,命令执行位置在当前文件所在的文件夹。
# 如果以下目录存在, 可以直接复制:
cp -r /lustre/public/tutorial/models/models--BAAI--bge-m3/ ./
# 否则请自行下载:
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download BAAI/bge-m3 --local-dir models--BAAI--bge-m3
(建议使用 tmux 工具进行数据下载。tmux(Terminal Multiplexer)是一个终端复用器,它允许用户在一个单一终端会话中运行多个终端会话,并且它的会话可以在不同的时间和地点断开和重新连接,非常适合远程工作和需要长时间运行的任务。关于 tmux 的安装和介绍参考:https://tmuxcheatsheet.com/how-to-install-tmux ; 使用参考: https://tmuxcheatsheet.com)
3.1 稠密检索¶
稠密检索使用低维、密集的向量表示文本数据,将文本嵌入到连续的向量空间中,能够捕捉语义相似性,适合处理自然语言处理(NLP)任务中的模糊查询和复杂语义关系。运行下方代码。
from FlagEmbedding import BGEM3FlagModel
# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3',
use_fp16=True)
# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
" lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a "
"set of documents based on the query terms "
"appearing in each document"]
# 计算 Embedding
embeddings_1 = model.encode(sentences_1,
batch_size=12,
max_length=8192,
)['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
# 计算相似度
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# 结果应该是:
# [[0.6265, 0.3477], [0.3499, 0.678 ]]
3.2 稀疏检索¶
稀疏检索使用高维、稀疏的向量表示文本,其中大部分特征值为零,其计算效率高,易于解释,适合处理短文本和关键词匹配。运行下方代码。
from FlagEmbedding import BGEM3FlagModel
# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3', use_fp16=True)
# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
" lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a "
"set of documents based on the query terms "
"appearing in each document"]
# 通过 lexical mathcing 计算相似度
output_1 = model.encode(
sentences_1, return_dense=True, return_sparse=True,
return_colbert_vecs=False)
output_2 = model.encode(
sentences_2, return_dense=True, return_sparse=True,
return_colbert_vecs=False)
lexical_scores = model.compute_lexical_matching_score(
output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0.19554901123046875
print(model.compute_lexical_matching_score(
output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
# 0.0
# 查看每个 token 的 weight:
print(model.convert_id_to_token(output_1['lexical_weights']))
# [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252,
# 'M': 0.1702, '3': 0.2695, '?': 0.04092},
# {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633,
# 'BM': 0.2515, '25': 0.3335}]
3.3 多向量检索¶
多向量检索是一种混合方法,结合了稠密和稀疏检索的优点,使用多个向量来表示一个文档或查询。运行下方代码。
from FlagEmbedding import BGEM3FlagModel
# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3', use_fp16=True)
# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
" lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a "
"set of documents based on the query terms "
"appearing in each document"]
# 通过 colbert 计算相似度
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True,
return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True,
return_colbert_vecs=True)
print(model.colbert_score(
output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
print(model.colbert_score(
output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
# 0.7797
# 0.4620
3.4 加权语义相似度¶
计算三种检索的加权平均值:
from FlagEmbedding import BGEM3FlagModel
# 填写模型路径
# VAR_PLACEHOLDER
model = BGEM3FlagModel('models--BAAI--bge-m3', use_fp16=True)
# 待计算的句子
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval,"
" lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a "
"set of documents based on the query terms "
"appearing in each document"]
sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
# 计算混合相似度
# w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
print(model.compute_score(sentence_pairs,
max_passage_length=128,
weights_for_different_modes=[0.4, 0.2, 0.4]))
作者: 黎颖; 龙汀汀
联系方式: yingliclaire@pku.edu.cn; l.tingting@pku.edu.cn