Comment extraire le texte d'un PDF Fichier avec Apache PDFBox

Question

Je voudrais extraire le texte d'un fichier PDF donné avec Apache PDFBox.

J'ai écrit ce code:

PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(filepath); PDFParser parser = new PDFParser(new FileInputStream(file)); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText);

Cependant, j'ai eu l'erreur suivante:

Exception in thread "main" Java.lang.NullPointerException at org.Apache.fontbox.afm.AFMParser.main(AFMParser.Java:304)

J'ai ajouté pdfbox-1.8.5.jar et fontbox-1.8.5.jar au chemin de classe.

Modifier

J'ai ajouté System.out.println("program starts"); au début du programme.

Je l'ai couru, puis j'ai eu la même erreur que celle mentionnée ci-dessus et program starts n'apparaissait pas dans la console.

Ainsi, je pense avoir un problème avec le chemin de classe ou quelque chose.

Je vous remercie.

Emad · Accepted Answer

J'ai exécuté votre code et cela a fonctionné correctement. Peut-être que votre problème est lié à FilePath que vous avez donné au fichier. Je mets mon pdf en C et codé en dur le chemin du fichier. Voici mon code:

// PDFBox 2.0.8 require org.Apache.pdfbox.io.RandomAccessRead // import org.Apache.pdfbox.io.RandomAccessFile; public class PDFReader{ public static void main(String args[]) { PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File("C:/my.pdf"); try { // PDFBox 2.0.8 require org.Apache.pdfbox.io.RandomAccessRead // RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r"); // PDFParser parser = new PDFParser(randomAccessFile); PDFParser parser = new PDFParser(new FileInputStream(file)); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }

Matthias Braun · Answer

Avec PDFBox 2.0.7, voici comment obtenir le texte d'un PDF:

static String getText(File pdfFile) throws IOException { PDDocument doc = PDDocument.load(pdfFile); return new PDFTextStripper().getText(doc); }

Appelez ça comme ça:

try { String text = getText(new File("/home/me/test.pdf")); System.out.println("Text in PDF: " + text); } catch (IOException e) { e.printStackTrace(); }

Depuis l'utilisateur oivemaria demandé dans les commentaires:

Vous pouvez utiliser PDFBox dans votre application en l'ajoutant à vos dépendances dans build.gradle:

dependencies { compile group: 'org.Apache.pdfbox', name: 'pdfbox', version: '2.0.7' }

Voici plus sur la gestion de la dépendance à l'aide de Gradle.

Si vous souhaitez conserver le format du fichier PDF dans le texte analysé, essayez PDFLayoutTextStripper a try.

sonus21 · Answer

PdfBox 2.0.3 a également un outil de ligne de commande.

Télécharger le fichier jar
Java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]

Options: -password <password> : Password to decrypt document -encoding <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc. -console : Send text to console instead of file -html : Output in HTML format instead of raw text -sort : Sort the text before writing -ignoreBeads : Disables the separation by beads -debug : Enables debug output about the time consumption of every stage -startPage <number> : The first page to start extraction(1 based) -endPage <number> : The last page to extract(inclusive) <inputfile> : The PDF document to use [output-text-file] : The file to write the text to

Farkas Csan&#225;d · Answer

Maven dep:

 <dependency> <groupId>org.Apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.9</version> </dependency>

Ensuite, la fonction pour obtenir le texte pdf en tant que chaîne.

private static String readPDF(File pdf) throws InvalidPasswordException, IOException { try (PDDocument document = PDDocument.load(pdf)) { document.getClass(); if (!document.isEncrypted()) { PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); PDFTextStripper tStripper = new PDFTextStripper(); String pdfFileInText = tStripper.getText(document); // System.out.println("Text:" + st); // split by whitespace String lines[] = pdfFileInText.split("\r?\n"); List<String> pdfLines = new ArrayList<>(); StringBuilder sb = new StringBuilder(); for (String line : lines) { System.out.println(line); pdfLines.add(line); sb.append(line + "
"); } return sb.toString(); } } return null; }

Sunil K Chaudhary · Answer

Cela fonctionne bien pour extraire les données d'un fichier PDF contenant du contenu texte à l'aide de pdfbox 2.0.6.

import Java.io.File; import Java.io.IOException; import org.Apache.pdfbox.pdmodel.PDDocument; import org.Apache.pdfbox.text.PDFTextStripper; import org.Apache.pdfbox.text.PDFTextStripperByArea; public class PDFTextExtractor { public static void main(String[] args) throws IOException { System.out.println(readParaFromPDF("C:\sample1.pdf",3, "Enter Start Text Here", "Enter Ending Text Here")); //Enter FilePath, Page Number, StartsWith, EndsWith } public static String readParaFromPDF(String pdfPath, int pageNo, String strStartIndentifier, String strEndIdentifier) { String returnString = ""; try { PDDocument document = PDDocument.load(new File(pdfPath)); document.getClass(); if (!document.isEncrypted()) { PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.setSortByPosition(true); PDFTextStripper tStripper = new PDFTextStripper(); tStripper.setStartPage(pageNo); tStripper.setEndPage(pageNo); String pdfFileInText = tStripper.getText(document); String strStart = strStartIndentifier; String strEnd = strEndIdentifier; int startInddex = pdfFileInText.indexOf(strStart); int endInddex = pdfFileInText.indexOf(strEnd); returnString = pdfFileInText.substring(startInddex, endInddex) + strEnd; } } catch (Exception e) { returnString = "No ParaGraph Found"; } return returnString; } }