AI Framework for Non-Native PDF Comparison

Authors

  • Maragathamani Subramaniam Data Science Senior Analyst Accenture, Coimbatore, Tamilnadu, India Author
  • Diwakar Ojha Data Scientist, Accenture, Coimbatore, Tamilnadu, India Author
  • Anshuman Mahapatra Data Science Senior Manager, Accenture, Coimbatore, Tamilnadu, India Author
  • Subhashini Lakshminarayan Data Science Senior Manager , Accenture, Coimbatore, Tamilnadu, India Author

DOI:

https://doi.org/10.47392/IRJASH.2023.064

Keywords:

Layout Detection, PDF com parison, PDF Page labelling, Image change tracking

Abstract

Figuring out how a document has changed from one version to another isn’t always the simplest task. We encountered the problem of comparing two PDF documents, edited using different editing tools. When we tried to compare these PDFs, using existing comparison tools, comparison results were not satisfactory. After analysis, we found that if documents had been edited using any other tool than Acrobat (non-Native), then these tools were unable to detect the proper layout (paragraphs, headers, footers, columns, tables, etc.) of the document and therefore unable to sequence them in the correct order, resulting in false comparison output. To overcome this problem, we tried the latest developments in computer vision to detect the layout information of the document. Using layout information, contents were arranged in the correct order and then compared. This resulted in better comparison output. Also, using AI for layout detection made it independent of how the document was created and edited. We built a complete framework which includes reading the information, detecting layout, arranging information, comparing it, and visualizing the differences. This Framework can be applied to build any document comparison tool irrespective of document type.

Downloads

Published

2023-09-30