Processing of larger PDF documents | Voters

Processing of larger PDF documents

complete

Brandon McIntosh

During testing, large PDF documents (NIST IR 8428 or other research articles for example) exceed the token output window of the LLM that's in use. I suggest one of two recursive solutions: 
separate large documents into smaller parts and create individual cards for each part.
a. Bad: may create section headers that don't make sense when attempting to determine the original document those section headers were created from. Goes against what appears to be the general theory behind Recall. 
b. Good: Summaries themselves maintain the maximum output token window for each section. 
Page and summary separation and ultimate compression to a maximum token output window.
a. Bad: Maximum token output requires total compression of all previous summaries into one condensed summary. Substantial context may be lost due to the lack of reference back to the original source material when compressing summaries. 
b. Document is kept in one card.
An example pseudo-function for option 2 is as follows: 
FUNCTION summarizeLargePDF(pdfFilePath, maxInputTokens, maxOutputTokens, llm):
Initialization
currentPage = 1
totalTokens = 0
currentSummary = ""  
allSummaries = []  # Array to hold section summaries
sectionStartPage = 1
Recursive Page Processing Function (Nested)
FUNCTION processPage(pageNumber):
pageText = extractTextFromPage(pdfFilePath, pageNumber) 
pageTokens = countTokens(pageText)
Check if adding this page exceeds input token limit
IF (totalTokens + pageTokens) > maxInputTokens:
Summarize the section
sectionEndPage = pageNumber - 1
sectionText = extractTextFromPages(pdfFilePath, sectionStartPage, sectionEndPage)
sectionSummary = llm.summarize(sectionText, maxOutputTokens)
Add summary to the list and reset section
allSummaries.append(sectionSummary)
sectionStartPage = pageNumber
totalTokens = 0  
totalTokens += pageTokens
currentPage += 1
Recursive call for the next page
processPage(currentPage)
Main Function Logic
processPage(currentPage)  # Start recursive processing
Final Summarization of all Section Summaries
WHILE allSummaries.length > 1:
Combine and re-summarize in pairs
newSummaries = []
FOR i = 0 TO allSummaries.length - 1 STEP 2:
combinedText = allSummaries[i] + allSummaries[i+1]  
newSummary = llm.summarize(combinedText, maxOutputTokens)
newSummaries.append(newSummary)
allSummaries = newSummaries  
The single remaining summary is the final result
finalSummary = allSummaries[0]
RETURN finalSummary

July 26, 2024

Sankari Nair

An even larger limit - you can now summarize up to 100MB PDFS!

Sankari Nair

marked this post as

complete

You can now summarize between 200-300 page PDFs!