In this talk I describe how to build a PDF processor. The processor takes PDF files as input, uses Optical Character Recognition (OCR) to extract the text for each input PDF, classifies the document into one of twenty article categories and summarises the text to an arbitrary number of sententences (by default 3 using Latent Semantic Analysis (LSA)), before composing the document title, classification, summary and original text into a text file which is emailed to the user.
The dashboard shows a view of the input PDFs and output files. Documents can be uploaded by dragging and dropping onto the left hand side of the dashboard. The scenario to process the input PDFs can be triggered from here, and the text file contatining the classification, summary and full text will appear in the folder on the right hand side. Please see below for further information.
Arbitrary PDF files can be placed into the input folder for classfication and summarisation.
The document classification model was trained on the 20 newsgroups dataset (documentation) which is publicly available from sci-kit learn as described here.
The PDFs are first converted to image files using pdf2image (documentation) here. The OCR is then performed using Google’s open-source Tesseract library (documentation) here.
As described above, the model was trained on the 20 newsgroups dataset available from sci-kit learn here. The model uses a multiclass logistic regression classifier built on top of a tf-idf transformer and achieves an AUC of 0.995. Full details of the model are available here.
The text summarisation was performed using Dataiku’s Text Summarisation plugin (documentation). This plugin provides an interface for three popular text summarisation algorithms including TextRank, KL-Sum and LSA. By default this project uses LSA to select 3 sentences to summarise the document, but this can be changed using the plugin UI.
The top part of the flow converts the PDFs to images, extracts the text using OCR and summarises the document.
The bottom part of the flow gets the documents and labels, joins them together to create the training data, trains the model on this training data and uses the model to classify the document.
The document summary is then joined to the document classification, which are cleaned up and written to output files for each document.
The flow can be run using the scenario ProcessDocument. This scenario will extract the text from all input PDFs, classify and summarise the document, create the output file and email this to the user. By default, text extraction will not be re-performed if the PDF has been processed previously (so previously processed PDFs will not be cleared even if they are removed from the input), but this behaviour can be modified by changing the project variable ‘reprocess_PDFs’ to ‘True’ in the first step of the scenario.
Alternatively, when a new PDF is dropped into the input folder, DSS will automatically recognise that the input has changed, run the flow and email the user the end result. DSS is currently set to check the input folder every 2 minutes and will run the flow it detects that the input has changed and there have been no further changes made over the following minute.