LangChain学习笔记

为什么要学习LangChain

我希望能够构建一个能阅读PDF论文的Agent，并且能够输出对论文优缺点的评价。

导师 2024-10-12 14:30

做一个论文阅读的大模型。

2024-10-12 14:45 我

好的老师。

使用LangChain听说比较方便。

LangChain是用来做什么的？

LangChain是一个用于开发由LLM驱动的应用程序的框架。也就是说我们可以把LLM作为内核，LangChain作为外壳，搭建一个程序出来。

LangChain提供了

组件：处理LLM的组件的抽象；
定制链：把组件拼起来，实现一个特定用例。

对于阅读PDF，目前有两个想法：

将PDF转为JSON，然后输入到LLM中；
构建RAG。使用LangChain能够比较方便地实现这个功能，听ZLB说这个也不是很难。我之前的畏难情绪可能太重了，现在写一个文档，激励和记录一下自己学习。

RAG是什么？

虽然LLM非常强大，但它们对于它们未经训练的信息一无所知。如果您想使用LLM来回答它未经训练的文档相关问题，您需要向其提供这些文档的信息。最常用的方法是通过“检索增强生成”（ retrieval augmented generation，RAG ）。

检索增强生成的思想是，在给定一个问题时，首先进行检索步骤以获取任何相关文档。然后将这些文档与原始问题一起传递给语言模型，并让它生成一个回答。然而，为了做到这一点，首先需要将文档以适合进行此类查询的格式呈现。

构造一个语义搜索引擎

Build a semantic search engine | 🦜️🔗 LangChain

读取PDF

How to load PDFs | 🦜️🔗 LangChain

这里，文档中推荐使用了pypdf库。这里

在实际应用中可以使用其他提取效果更好的库。LangChain支持的PDF格式很多，可以选择一下。

Document Loader	Description	Package/API
PyPDF	Uses `pypdf` to load and parse PDFs	Package
Unstructured	Uses Unstructured’s open source library to load PDFs	Package
Amazon Textract	Uses AWS API to load PDFs	API
MathPix	Uses MathPix to load PDFs	Package
PDFPlumber	Load PDF files using PDFPlumber	Package
PyPDFDirectry	Load a directory with PDF files	Package
PyPDFium2	Load PDF files using PyPDFium2	Package
PyMuPDF	Load PDF files using PyMuPDF	Package
PDFMiner	Load PDF files using PDFMiner	Package

此外，导师之前还给我推荐了titipata/scipdf_parser库，能够更好地处理图像和扫描文本，并且运行在docker上，便于部署。

pypdf的介绍

Welcome to pypdf — pypdf 5.1.0 documentation

PyPDF 是一个用于处理 PDF 文件的 Python库。它提供了一组工具和功能，用于读取、解析和操作 PDF 文件的内容。

Splitting

原文

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve Document objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not “washed out” by surrounding text.

We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

See this guide for more detail about working with PDFs, including how to extract text from specific sections and images.

对于问题提问的文本来说，直接回答一整页肯定是太粗略了。我们最终的目标是检索回答输入查询的文档对象，进一步拆分 PDF 将有助于确保文档相关部分的含义不会被周围的文本“冲淡”。

所以接下来应该用文本分割器来进行分割（Splitting）处理。这里用一个RecursiveCharacterTextSplitter进行分割。这里使用常见分隔符来对文档进行分割，适用于一般的文本。

使用RecursiveCharacterTextSplitter无法读取图像或特定区域的文本。

Embeddings

接下来将文本嵌入到向量中去，便于进行相似度指标来识别相关文本。

这里LangChain支持数十种Embeddings方法。这里我选择了使用Hugging Face，可以选择将模型下载至本地或者使用Hugging Face Inference API来调用接口。这里可以直接使用HuggingFaceEmbeddings来进行处理。非常方便。

from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embeddings = embeddings_model

vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Vector Stores

LangChain的Vector Stores对象包括了一些把文本和Document对象加入到Stores中的方法，然后通过相似性进行一个排列。

from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

ids = vector_store.add_documents(documents=all_splits)

此时就完成了存储和排列。

这里向量存储一般来说是可以连接到现有的Vector Stores中的。

Usage

查询和这句话相似的句子

results = vector_store.similarity_search(
    "Diffusion is a image generation method."
)
)

print(results[0])

异步查询（用于流程控制）

results = await vector_store.asimilarity_search("What is diffusion?")

print(results[0])

返回分数

# Note that providers implement different scores; 
# the score here is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What is Diffusion?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

通过和embedded query的相似度进行查询

embedding = embeddings.embed_query("What is diffusion")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

Retrievers

检索器（Retriever）可以从向量存储中进行构建，但是也可以和非向量形式进行交互。如果我们要构建一个能够检索文档的方法的话，我们可以创建一个runnable的检索器。

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "What is diffusion?",
        "What is forward process?",
    ],
)

至此，我们构建了一个能够读多篇PDF文章的、能够对PDF文章进行查询的语义搜索引擎。

Chat Models和Prompt模板

这里通过Vllm启动LLM，以Qwen2.5-7B-Instruct模型为例。

from langchain_community.llms import VLLM

llm = VLLM(model="/home/ubuntu/jjq/Qwen/Qwen2.5-7B-Instruct/",
           trust_remote_code=True,  
           max_new_tokens=512,
           top_k=10,
           top_p=0.95,
           temperature=0.8,
           max_model_len = 30000,
)

print(llm("What is the capital of France ?"))

接下来设计Prompt模板。

from langchain import LLMChain
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts.chat import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

template = '''
        【任务描述】
        请仔细阅读论文，回答用户给出的问题，尽量具有批判性。

        【论文】
        {{context}}

        -----------
        {question}
        '''

# 检索器
retriever = db.as_retriever()
# 记忆
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
# 构建Agent
qa = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)

qa({"question": "能不能用中文给出论文的优势或者前景？"})