百川向量算法与LangChain的集成/ ai #34

随着国内大模型的发展,有些大模型的功能已开始向OpenAI靠拢了,其中向量算法就是其中之一。去年在做向量数据库的测试和部署时就没有发现足够好用的国产向量算法,只得采用OpenAI的向量算法。现在,向量算法完全可以转向国内的算法啰,百川的向量算法就很不错。

baichuan.jpg

https://www.baichuan-ai.com/home

在实际的应用中,一般会与LangChain进行集成,开发出你需要的应用或是agent。下面是集成的向量存储和查询代码,大家可以做些参考。

pip install langchain chromadb python-dotenv -i https://pypi.tuna.tsinghua.edu.cn/simple  //国内一定要切换源

from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
import chromadb
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import BaichuanTextEmbeddings
from dotenv import dotenv_values


env_vars = dotenv_values('.env')
# 百川向量算法
embedding = BaichuanTextEmbeddings(baichuan_api_key=env_vars['BAICHUAN_API_KEY'])

# 连接向量数据库
httpClient = chromadb.PersistentClient(path="./chromac")

# 文本分割
text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)
# chunk_size=400表示每次读取或写入数据时,数据的大小为400个字节, 约300~400个汉字

# 存入文档到数据库  
def saveToVectorDB(file, collection):
    try:
        # 加载文本
        loader = TextLoader(file, encoding='gbk')  #中文必须带 encoding='gbk'
        documents = loader.load()
        docs = text_splitter.split_documents(documents)
        # 数据库实例化,并存入文档  
        vectordb = Chroma(collection_name=collection, embedding_function=embedding, client=httpClient)
        ids = vectordb.add_documents_x(documents=docs)
        return "ok", 200
    except Exception as e:
        print(444, "saveDb error", e)
        return "error", 500


# 查询相似度的文本,同时根据分值过滤相关性
def queryVectorDB(ask, collection): 
    vectordb = Chroma(collection_name=collection, embedding_function=embedding, client=httpClient)
    s = vectordb.similarity_search_with_score(query=ask, k=1) 
    if len(s) == 0:
        return ""
    else:
        if s[0][1] < 1.35:   # 文本关联强则返回,不相关则不返回. openai < 0.385  baichuan < 1.35
            return s[0][0].page_content
        else:
            return ""


# 查询相似度的文本,返回所有数据
def queryText(ask, collection): 
    vectordb = Chroma(collection_name=collection, embedding_function=embedding, client=httpClient)
    res = vectordb.similarity_search_all(query=ask, k=1)
    return res

# 删除文本
def deleteText(ids, collection): 
    vectordb = Chroma(collection_name=collection, embedding_function=embedding, client=httpClient)
    res = vectordb.delete(ids=[ids])
    return "ok", 200

# 添加文本到数据库,返回结果id
def addText(text, collection): 
    vectordb = Chroma(collection_name=collection, embedding_function=embedding, client=httpClient)
    res = vectordb.add_texts(texts=[text])
    return res

数据库的增删改查都有了,齐活了!