在 RAG(Retrieval Augmented Generation)应用中,最负有挑战性的问题之一是如何处理复杂文档的内容,比如在 PDF 文档中的图片、表格等,因为这些内容不像传统文本那样容易解析和检索。在本文中,我们将介绍几种关于内嵌表格的 RAG 方案,讲解其中解析和检索的技术细节,并通过代码示例让大家更好地理解其中的原理,同时对这些方案进行分析和对比,阐述它们的优缺点。
内嵌表格解析与检索
PDF 文件的内嵌表格解析一直以来都是一个技术难点,因为 PDF 文件中的表格可能采用不同的编码和字体,甚至以图像形式存在,需要使用 OCR 技术来识别,而图像质量和字体模糊可能影响识别的准确性。此外,PDF 文件中的表格具有复杂的格式和布局,包括合并单元格、嵌套表格和多列布局,使得识别和提取表格数据变得复杂。复杂的表格结构、跨页表格以及不一致性也增加了解析的难度。
\begin{table} \begin{tabular}{l c c c} \hline \hline Layer Type & Complexity per Layer & Sequential Operations & Maximum Path Length \\ \hline Self-Attention & \(O(n^{2}\cdot d)\) & \(O(1)\) & \(O(1)\) \\ Recurrent & \(O(n\cdot d^{2})\) & \(O(n)\) & \(O(n)\) \\ Convolutional & \(O(k\cdot n\cdot d^{2})\) & \(O(1)\) & \(O(log_{k}(n))\) \\ Self-Attention (restricted) & \(O(r\cdot n\cdot d)\) & \(O(1)\) & \(O(n/r)\) \\ \hline \hline \end{tabular} \end{table} Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. \(n\) is the sequence length, \(d\) is the representation dimension, \(k\) is the kernel size of convolutions and \(r\) the size of the neighborhood in restricted self-attention.
from llama_index.core import VectorStoreIndex from llama_index.core.schema import TextNode
question = "when layer type is Convolutional, what is the Maximum Path Length?" print(f"question: {question}")
nodes = [TextNode(text=t) for t in tables] vector_index = VectorStoreIndex(nodes) query_engine = vector_index.as_query_engine(similarity_top_k=2) response = query_engine.query(question) print(f"answer: {response}") print("Source nodes: ") for node in response.source_nodes: print(f"node text: {node.text}")
我们首先将tabels列表中的表格内容转换为TextNode对象
然后使用VectorStoreIndex将TextNode对象转换为索引
使用query方法对问题进行检索,获取检索结果
RAG 检索的结果如下:
1 2 3 4 5 6 7 8 9 10
question: when layer type is Convolutional, what is the Maximum Path Length? answer: The Maximum Path Length for the Convolutional layer type is \(O(log_{k}(n))\). Source nodes: node text: \begin{tabular}{l c c c} \hline \hline Layer Type & Complexity per Layer & Sequential Operations & Maximum Path Length \\ \hline Self-Attention & \(O(n^{2}\cdot d)\) & \(O(1)\) & \(O(1)\) \\ Recurrent & \(O(n\cdot d^{2})\) & \(O(n)\) & \(O(n)\) \\ Convolutional & \(O(k\cdot n\cdot d^{2})\) & \(O(1)\) & \(O(log_{k}(n))\) \\ Self-Attention (restricted) & \(O(r\cdot n\cdot d)\) & \(O(1)\) & \(O(n/r)\) \\ \hline \hline \end{tabular} Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. \(n\) is the sequence length, \(d\) is the representation dimension, \(k\) is the kernel size of convolutions and \(r\) the size of the neighborhood in restricted self-attention.
Nougat 是用学术论文进行训练的模型,因此对学术论文文档解析效果很好,但其他类型的 PDF 文档解析效果可能不尽人意
只对英文文档支持较好,对其他语言的支持有限
需要 GPU 机器进行解析加速
UnstructuredIO 方案
这种方案是先将 PDF 文件转换成 HTML 文件,然后使用 UnstructuredIO 来解析 HTML 文件,LlamaIndex 已经对 UnstructuredIO 进行了集成,因此可以很方便地将对 HTML 文件进行 RAG 的流程处理,包括文件的索引、存储和检索。
为什么要转成 HTML 文件?在 PDF 文件中表格的内容不容易识别,而在 HTML 文件中表格的内容一般以table的标签来表示,可以很容易地解析和提取表格数据。LlamaIndex 在集成 UnstructuredIO 时只实现了对 HTML 文件的解析,我猜测是因为 HTML 文件的解析相对简单,虽然 UnstructuredIO 本身也支持 PDF 文件的解析,但是 PDF 文件的解析需要依赖第三方的模型和工具,整体实施起来会比较复杂。
PDF 转 HTML
在开源社区中有很多工具可以将 PDF 文件转换成 HTML 文件,其中比较出名的是 pdf2htmlEX,但经过测试发现在 pdf2htmlEX 解析出来的 HTML 文件中,表格的内容并没有以table标签进行展示,而是以div标签来表示(如下图所示),这使得我们无法使用 UnstructuredIO 来解析表格内容,因此我们需要使用其他工具来转换 PDF。
这里推荐一个名为 WebViewer 的文档工具,提供了常用文档的编辑功能,其中包括我们需要的 PDF 转 HTML 功能,并且它提供了多种开发语言的 SDK 包,方便在各种项目中集成使用。下面我们就以 Python 为例来介绍如何使用这个工具转换 PDF 文件为 HTML 文件。
使用HTMLOutputOptions 设置 HTML 输出选项,这里的设置表示输出的 HTML 会整合成一个完整的页面
最后使用Convert.ToHtml函数对 PDF 文件进行转换,转换后的 HTML 文件会保存在output目录下
转换后的 HTML 文件我们可以看到,其中的表格内容是以table的标签来表示的,关于使用 WebViewer 来转换 PDF 文件为 HTML 文件的更多信息可以参考这里。
HTML 文件处理
得到 HTML 文件后,我们就可以使用 LlamaIndex 中集成的 UnstructuredIO 解析功能来解析 HTML 中的表格内容了,代码示例如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
import os import pickle from pathlib import Path from llama_index.readers.file import FlatReader from llama_index.core.node_parser import UnstructuredElementNodeParser
# 表格字段 -------- col_schema: Column: Model Type: string Summary: Names of the AI models compared
...other columns...
filename: Qwen-VL.html extension: .html # 表格的总结信息 Comparison of performance metrics for different AI models across various tasks such as DocVQA, ChartQA, AI2D, TextVQA, MMMU, MathVista, and MM-Bench-CN., with the following table title: AI Model Performance Comparison, with the following columns: - Model: Names of the AI models compared ...other columns... -------- # 表格节点ID -------- Index ID: 41edc9a6-30ed-44cf-967e-685f7dfce8df -------- # mapping中的表格数据 -------- Comparison of performance metrics for different AI models across various tasks such as DocVQA, ChartQA, AI2D, TextVQA, MMMU, MathVista, and MM-Bench-CN., with the following table title: AI Model Performance Comparison, with the following columns: - Model: Names of the AI models compared ...other columns...
from llama_index.core import VectorStoreIndex from llama_index.core.retrievers import RecursiveRetriever from llama_index.core.query_engine import RetrieverQueryEngine
recursive_retriever = RecursiveRetriever( "vector", retriever_dict={"vector": vector_retriever}, node_dict=node_mappings, verbose=True, ) query_engine = RetrieverQueryEngine.from_args(recursive_retriever) question = "In the comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Qwen-VL-Plus' in task 'MMMU'? Tell me the exact number." response = query_engine.query(question) print(f"answer: {str(response)}")
# 显示结果 Retrieving with query idNone: In the comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Qwen-VL-Plus'in task 'MMMU'? Tell me the exact number. Retrieved node withid, entering: 41edc9a6-30ed-44cf-967e-685f7dfce8df Retrieving with query id 41edc9a6-30ed-44cf-967e-685f7dfce8df: In the comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Qwen-VL-Plus'in task 'MMMU'? Tell me the exact number.
answer: The performance metric of the model 'Qwen-VL-Plus'in the task 'MMMU'is45.2%.
models = [ "Other BestOpen-source LVLM", "Gemini Pro", "Gemini Ultra", "GPT-4V", "Qwen-VL-Plus", "Qwen-VL-Max", ] metrics = ["DocVQA", "ChartQA", "AI2D", "TextVQA", "MMMU", "MathVista", "MM-Bench-CN"] questions = [] for model in models: for metric in metrics: questions.append( f"In the comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model '{model}' in task '{metric}'? Tell me the exact number." )
result = {} for q in questions: response = query_engine.query(q) answer = str(response) result[q] = str(actual_answers[q]) in answer print(f"question: {q}\nresponse: {answer}\nactual:{actual_answers[q]}\nresult:{result[q]}\n\n")
# 计算准确率 correct = sum(result.values()) total = len(result) print(f"Percentage of True values: {correct / total * 100}%")
代码中我们构造了 42 个问题,每个问题都是关于表格中不同 AI 模型在不同任务中的性能指标
然后我们通过查询引擎对这些问题进行检索,获取检索结果
最后我们将检索结果与实际的性能指标进行比较,计算准确率
计算结果如下:
1 2 3 4 5 6 7 8 9 10 11 12
Retrieving with query id None: In comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Other BestOpen-source LVLM'in task 'DocVQA'? Tell me the exact number. Retrieved node with id, entering: 41edc9a6-30ed-44cf-967e-685f7dfce8df Retrieving with query id 41edc9a6-30ed-44cf-967e-685f7dfce8df: In comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Other BestOpen-source LVLM'in task 'DocVQA'? Tell me the exact number.
question: In the comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Other BestOpen-source LVLM'in task 'DocVQA'? Tell me the exact number. response: 81.6% actual:81.6 result:True
defget_nodes(docs): """Split docs into nodes, by separator.""" nodes = [] for doc in docs: doc_chunks = doc.text.split("\n---\n") for doc_chunk in doc_chunks: node = TextNode( text=doc_chunk, metadata=deepcopy(doc.metadata), ) nodes.append(node)
return nodes
nodes = get_nodes(documents_gpt4o) vector_index = VectorStoreIndex(nodes) query_engine = vector_index.as_query_engine(similarity_top_k=2) question = "In the comparison of performance metrics for different AI models across various tasks. What is the performance metric of the model 'Qwen-VL-Plus' in task 'MMMU'? Tell me the exact number." response = query_engine.query(question) print(f"answer: {str(response)}")
# 显示结果 answer: The performance metric of the model 'Qwen-VL-Plus'in the task 'MMMU'is45.2%.
LlamaParse 在解析 PDF 文件时会在 Markdown 内容中添加---这样的分页标签,我们通过这个标签将 Markdown 内容分割成多个节点,然后将这些节点转换为TextNode对象