Langchain组件介绍(一)：搭建ChatBot，掌握这些组件就够了

本文最后更新于240 天前，其中的信息可能已经过时，如有错误请发送邮件到lvlvko233@qq.com

引言

Chatbot本质上是围绕模型层IO进行封装，这里我们把ChatBot相关的组件抽象为三部分：模型输入、模型、模型输出三部分。大纲如下：

模型
- LLM
- Chat Models
模型输入
- Prompt Template
- Memory
- Chat Loader(略)
模型输出
- Output Parser

模型

LLM

LLM(Large Language Model)实际上已经是非常少用的组件了，在Langchain里的功能是封装文本生成模型，接收一段文本，完成补全(Completion)的任务，即输入输出都为str，当然Langchain这里做了一定的适配，允许LLM接收Message类型的参数，但是这里和Chat Model同样存在一定的区别，这里后面会讲。

作用：文本补全
输入：LanguageModelInput = Union[PromptValue, str, Sequence[MessageLikeRepresentation]]
输出：str
底层方法：_stream
相关基础组件：
- BaseMessage
常用实现：
- OpenAI

以下是一个例子：

from langchain_community.llms.fake import FakeStreamingListLLM

response = "Hello, how can I help you today?"

llm = FakeStreamingListLLM(responses=[response], sleep=0.03)

for chunk in llm.stream(""):
    print(chunk, end="", flush=True)
print()

for chunk in llm.stream([("human", "")]):
    print(chunk, end="", flush=True)

# Hello, how can I help you today?
# Hello, how can I help you today?

Chat Model

Chat Model是Langchain中封装对话模型的组件，使用方法和输入参数基本上是一样的，从原理上说，Chat Model本质是基于对话格式的文本生成，这里画了个表来帮助理解它们对输入处理的细节

模型类型	输入	应用层	模型层	输出
LLM	str	透传	透传	str
LLM	messages	纯字符串拼接转换成str	透传	str
Chat Model	str	转换成human message	对话格式构造成str	str
Chat Model	messages	透传	对话格式构造成str	str

作用：对话生成
输入：LanguageModelInput = Union[PromptValue, str, Sequence[MessageLikeRepresentation]]
输出：
- invoke: BaseMessage
- stream: BaseMessageChunk
底层方法：_chat
相关基础组件：
- BaseMessage
常用实现：
- ChatOpenAI

以下是一个例子

from langchain_openai.chat_models import ChatOpenAI

_input = "50个字介绍一下python"

llm = ChatOpenAI(model="gpt-4o-mini")

print(llm.invoke(_input))

for chunk in llm.stream(_input):
    print(chunk.content, end="", flush=True)
print()

for chunk in llm.stream([("human", _input)]):
    print(chunk.content, end="", flush=True)

# content='Python是一种高级编程语言，以简洁和易读闻名。它支持多种编程范式，包括面向对象和函数式编程。广泛应用于数据分析、人工智能、Web开发等领域，拥有强大的社区和丰富的第三方库，适合初学者和专业开发者。' response_metadata={'token_usage': {'completion_tokens': 104, 'prompt_tokens': 16, 'total_tokens': 120}, 'model_name': 'gpt-4o-mini', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-fd16183a-b0a1-440c-928a-c07d4f564a8c-0' usage_metadata={'input_tokens': 16, 'output_tokens': 104, 'total_tokens': 120}
# Python是一种高级编程语言，以简洁易读的语法著称。它支持多种编程范式，包括面向对象和函数式编程，广泛应用于数据分析、人工智能、Web开发等领域，拥有强大的社区和丰富的库支持。
# Python是一种高级编程语言，因其简洁的语法和强大的标准库而广受欢迎。它支持多种编程范式，广泛应用于数据分析、人工智能、Web开发和自动化等领域，适合初学者和专业开发者。

模型输入

Prompt Template

Prompt是引导模型生成响应的关键部分。它通常由用户输入的文本构成，在Langchain中进一步提供了PromptTemplate，将复杂的Prompt模版化，用户无需改变整个提示词框架，只需要提供关键的输入即可。

作用：模版化模型输入，变量化用户输入
输入：Dict[str, Any]
输出：PromptValue
底层方法：format
相关基础组件：
- BaseMessage
- MessagePlaceholder
常用实现：
- PromptTemplate
- ChatPromptTemplate

以下是两个常用的子类：PromptTemplate 和ChatPromptTemplate的使用示例，前者用于构造单轮Message，后者可以构造多轮Message，更加灵活

from langchain.prompts import PromptTemplate

template = "请介绍一下{year}关于{topic}的信息。"

prompt = PromptTemplate(
    template=template
)

partial_prompt = prompt.partial(year="2024年")

print(prompt.invoke({"year": "2024年", "topic": "Agent"}))
print(partial_prompt.invoke({"topic": "Agent"}))
# text='请介绍一下2024年关于Agent的信息。'
# text='请介绍一下2024年关于Agent的信息。'

from langchain.prompts import ChatPromptTemplate
from langchain_core.prompts import MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages(
    messages=[
        ("system", "你是一个{role}大师"),
        MessagesPlaceholder(variable_name="placeholder", optional=True),
        ("human", "请你逐步进行分析，得出最终的结论，必要时可以通过mermaid去可视化地呈现，以下是我的问题: {question}"),
    ],
)

print(prompt.invoke({"role": "数据结构", "question": "深度为4的平衡二叉树最少有几个节点？"}) )
print(prompt.invoke({"role": "数据结构", "question": "深度为4的平衡二叉树最少有几个节点？", "placeholder": [("system", "给出golang的实现作为示例")]}) )

# messages=[SystemMessage(content='你是一个数据结构大师'), HumanMessage(content='请你逐步进行分析，得出最终的结论，必要时可以通过mermaid去可视化地呈现，以下是我的问题: 深度为4的平衡二叉树最少有几个节点？')]
# messages=[SystemMessage(content='你是一个数据结构大师'), SystemMessage(content='给出golang的实现作为示例'), HumanMessage(content='请你逐步进行分析，得出最终的结论，必要时可以通过mermaid去可视化地呈现，以下是我的问题: 深度为4的平衡二叉树最少有几个节点？')]

Memory&History

Memory和History是Langchain用于存储上下文的组件，在ChatBot中，模型主要依靠在应用层保存上下文来实现多轮对话，这里History为底层实现，负责存取多轮的消息，Memory为顶层封装，负责在存取消息的时候实现更复杂的控制如轮次控制、token数控制、压缩上下文等，两者关系类似代理模式。

作用：保存对话消息，从中获取上下文作为模型输入
输入：取决于你存储消息的逻辑，一般用session id来对应相应的会话
输出：Dict[str, Union[str, List[BaseMessage]]
相关基础组件：
常用实现

结合前面所学的内容，以下是一个多轮对话的ChatBot示例：

from langchain_openai.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.prompts import MessagesPlaceholder
from langchain.memory.buffer_window import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(memory_key="history", return_messages=True, k=10)


llm = ChatOpenAI(model="gpt-4o-mini")

prompt = ChatPromptTemplate.from_messages(
    messages=[
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ],
)

while True:
    prompt = prompt.partial(**memory.load_memory_variables(inputs={}))

    human_input = input("User: ")
    print("AI: ", end="", flush=True)

    chain = prompt | llm

    ai_output = ""
    for chunk in chain.stream({"input": human_input}):
        content = chunk.content
        print(content, end="", flush=True)
        ai_output += content
    print()
    
    memory.save_context(inputs={"_": human_input}, outputs={"_": ai_output})

	User: 你可以使用Hadoop Java SDK来实现MapReduce吗
	AI: 是的，可以使用Hadoop Java SDK来实现MapReduce。Hadoop是一个开源框架，用于处理大规模数据集，MapReduce是Hadoop的核心计算模型。以下是实现MapReduce的一般步骤：
	
	1. **设置开发环境**：确保你已经安装了Hadoop并配置了Java开发环境。你需要将Hadoop的库添加到你的Java项目中。
	
	2. **创建MapReduce程序**：
	   - **Mapper类**：实现`Mapper`接口，重写`map`方法。这个方法定义了如何处理输入数据并产生中间键值对。
	   - **Reducer类**：实现`Reducer`接口，重写`reduce`方法。这个方法用于处理Mapper输出的中间结果，并生成最终结果。
	   - **Driver类**：创建一个主类（通常称为Driver），在这个类中配置作业设置，包括输入输出格式、Mapper和Reducer类等。
	
	3. **编写代码**：
	   下面是一个简单的WordCount示例，统计文本中单词的出现次数。
	
	```java
	import org.apache.hadoop.conf.Configuration;
	import org.apache.hadoop.fs.Path;
	import org.apache.hadoop.io.IntWritable;
	import org.apache.hadoop.io.Text;
	import org.apache.hadoop.mapreduce.Job;
	import org.apache.hadoop.mapreduce.Mapper;
	import org.apache.hadoop.mapreduce.Reducer;
	import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
	import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
	
	import java.io.IOException;
	
	public class WordCount {
	
	    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
	        private final static IntWritable one = new IntWritable(1);
	        private Text word = new Text();
	
	        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
	            String[] words = value.toString().split("\\s+");
	            for (String w : words) {
	                word.set(w);
	                context.write(word, one);
	            }
	        }
	    }
	
	    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	        private IntWritable result = new IntWritable();
	
	        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
	            int sum = 0;
	            for (IntWritable val : values) {
	                sum += val.get();
	            }
	            result.set(sum);
	            context.write(key, result);
	        }
	    }
	
	    public static void main(String[] args) throws Exception {
	        Configuration conf = new Configuration();
	        Job job = Job.getInstance(conf, "word count");
	        job.setJarByClass(WordCount.class);
	        job.setMapperClass(TokenizerMapper.class);
	        job.setCombinerClass(IntSumReducer.class);
	        job.setReducerClass(IntSumReducer.class);
	        job.setOutputKeyClass(Text.class);
	        job.setOutputValueClass(IntWritable.class);
	        FileInputFormat.addInputPath(job, new Path(args[0]));
	        FileOutputFormat.setOutputPath(job,
	User: 有python版本的吗？
	AI: 是的，Hadoop 也可以通过 Python 来实现 MapReduce 程序，通常使用 `Hadoop Streaming`。Hadoop Streaming 允许用户以任意可执行文件作为 Mapper 和 Reducer，Python 脚本可以直接用作这些可执行文件。以下是一个使用 Python 实现的 WordCount 示例。
	
	### 1. Mapper 脚本（mapper.py）
	
	这个 Mapper 脚本将每一行拆分成单词，并输出每个单词及其计数（1）。
	
	```python
	#!/usr/bin/env python3
	import sys
	
	for line in sys.stdin:
	    # 去掉行末的换行符
	    line = line.strip()
	    # 拆分成单词
	    words = line.split()
	    for word in words:
	        # 输出单词和计数
	        print(f"{word}\t1")
	```
	
	### 2. Reducer 脚本（reducer.py）
	
	这个 Reducer 脚本将接收到的单词及其计数汇总，输出每个单词的总计数。
	
	```python
	#!/usr/bin/env python3
	import sys
	
	current_word = None
	current_count = 0
	
	for line in sys.stdin:
	    line = line.strip()
	    word, count = line.split('\t', 1)
	    count = int(count)
	
	    if current_word == word:
	        current_count += count
	    else:
	        if current_word:
	            # 输出当前单词及其计数
	            print(f"{current_word}\t{current_count}")
	        current_word = word
	        current_count = count
	
	# 输出最后一个单词及其计数
	if current_word == word:
	    print(f"{current_word}\t{current_count}")
	```
	
	### 3. 运行 MapReduce 作业
	
	将这两个脚本上传到 Hadoop 集群中，并确保它们具有可执行权限。然后，使用 Hadoop Streaming 来运行这个 MapReduce 作业。假设输入文件位于 HDFS 的 `/input` 目录中，输出目录为 `/output`，可以使用以下命令：
	
	```bash
	hadoop jar /path/to/hadoop-streaming.jar \
	    -input /input \
	    -output /output \
	    -mapper mapper.py \
	    -reducer reducer.py \
	    -file mapper.py \
	    -file reducer.py
	```
	
	### 注意事项
	
	1. 确保在 Hadoop 集群上安装了 Python，并且脚本的第一行指向正确的 Python 解释器。
	2. 输入和输出路径在 HDFS 中必须是有效的。
	3. 输出路径必须是不存在的，因为 Hadoop 不允许覆盖已经存在的输出目录。
	
	通过这种方式，你可以使用 Python 编写 MapReduce 程序，并利用 Hadoop 的分布式计算能力来处理大数据集。
	User:

Chat Loader(略)

Chat Loader是Langchain中用于加载对话历史的组件。虽然它不如Memory和Prompt Template那么常用，但在某些场景下非常有用，特别是当需要从外部源加载预先存在的对话历史时。

作用：从各种数据源加载对话历史
输入：取决于具体的实现，通常是文件路径或数据库连接信息
输出：List[ChatSession]
相关基础组件：
- BaseMessage
- ChatSession

模型输出

Output Parser

Output Parser是Langchain中用于解析模型输出的组件。它可以将模型的原始输出转换成更易于处理的结构化数据，使得后续的处理和分析变得更加方便。

作用：将模型的原始输出转换为结构化数据
输入：str 或 BaseMessage
输出：取决于具体的实现，可以是Python对象、字典、列表等
底层方法：parse
相关基础组件：
- BaseOutputParser
常用实现：

以下是一个使用JsonOutputParser的示例：

from typing import List
from langchain_core.output_parsers.json import JsonOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.pydantic_v1 import BaseModel

class Person(BaseModel):
    name: str
    age: int
    experience: List[str]

output_parser = JsonOutputParser(pydantic_object=Person)

prompt = ChatPromptTemplate.from_messages(
    messages=[
        ("system", "{format_instructions}"),
        ("human", "Tell me about {person}.")
    ]
)

prompt = prompt.partial(format_instructions=output_parser.get_format_instructions())


llm = ChatOpenAI(model="gpt-4o-mini")


chain = prompt | llm | output_parser

for chunk in chain.stream({"person": "Donald Trump"}):
    print(chunk, flush=True)

# {}
# {'name': ''}
# {'name': 'Donald'}
# {'name': 'Donald Trump'}
# {'name': 'Donald Trump', 'age': 77}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', '']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Business']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', '']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Tele']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Television']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Television personality']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Television personality', '']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Television personality', 'Real']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Television personality', 'Real estate']}
# {'name': 'Donald Trump', 'age': 77, 'experience': ['45th President of the United States', 'Businessman', 'Television personality', 'Real estate developer']}

Claude一键总结

🌟 本文详细介绍了Langchain中搭建ChatBot所需的核心组件，包括：

📊 模型：
- LLM：用于文本补全任务
- Chat Models：专门用于对话生成
📥 模型输入：
- Prompt Template：模板化和变量化用户输入
- Memory：存储对话历史，提供上下文
- Chat Loader：从外部源加载对话历史（简要提及）
📤 模型输出：
- Output Parser：将模型输出转换为结构化数据

🔧 这些组件共同构成了ChatBot的基础框架，每个组件都有其特定的作用和实现方式。通过灵活组合这些组件，开发者可以构建出功能强大、定制化的聊天机器人。

🚀 掌握这些核心组件，您就能够开始使用Langchain创建自己的ChatBot了！记住，实践是最好的学习方式，所以不要犹豫，开始动手尝试吧~ ٩(◕‿◕｡)۶

引言

模型

LLM