Home » pdfbox-1.1.0-src » org.apache.pdfbox.util » [javadoc | source]
org.apache.pdfbox.util
public class: PDFTextStripper [javadoc | source]
java.lang.Object
   org.apache.pdfbox.util.PDFStreamEngine
      org.apache.pdfbox.util.PDFTextStripper

Direct Known Subclasses:
    PDFText2HTML, PDFHighlighter, PDFTextStripperByArea, PrintTextLocations

This class will take a pdf document and strip out all of the text and ignore the formatting and such. Please note; it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document. The basic flow of this process is that we get a document and use a series of processXXX() functions that work on smaller and smaller chunks of the page. Eventually, we fully process each page and then print it.
Field Summary
protected  Vector charactersByArticle    The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry. 
protected  String lineSeparator    The platforms lineseparator. 
protected  String outputEncoding    encoding that text will be written in (or null). 
protected  PDDocument document    The document to read. 
protected  Writer output    The stream to write the output to. 
Constructor:
 public PDFTextStripper() throws IOException 
    Instantiate a new PDFTextStripper object. This object will load properties from Resources/PDFTextStripper.properties and will not do anything special to convert the text to a more encoding-specific output.
    Throws:
    IOException - If there is an error loading the properties.
 public PDFTextStripper(Properties props) throws IOException 
    Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.
    Parameters:
    props - The properties containing the mapping of operators to PDFOperator classes.
    Throws:
    IOException - If there is an error reading the properties.
 public PDFTextStripper(String encoding) throws IOException 
    Instantiate a new PDFTextStripper object. This object will load properties from Resources/PDFTextStripper.properties and will apply encoding-specific conversions to the output text.
    Parameters:
    encoding - The encoding that the output will be written in.
    Throws:
    IOException - If there is an error reading the properties.
Method from org.apache.pdfbox.util.PDFTextStripper Summary:
endArticle,   endDocument,   endPage,   getAverageCharTolerance,   getCharactersByArticle,   getCurrentPageNo,   getEndBookmark,   getEndPage,   getLineSeparator,   getOutput,   getPageSeparator,   getSpacingTolerance,   getStartBookmark,   getStartPage,   getText,   getText,   getWordSeparator,   processPage,   processPages,   processTextPosition,   setAverageCharTolerance,   setEndBookmark,   setEndPage,   setLineSeparator,   setPageSeparator,   setShouldSeparateByBeads,   setSortByPosition,   setSpacingTolerance,   setStartBookmark,   setStartPage,   setSuppressDuplicateOverlappingText,   setWordSeparator,   shouldSeparateByBeads,   shouldSortByPosition,   shouldSuppressDuplicateOverlappingText,   startArticle,   startArticle,   startDocument,   startPage,   writeCharacters,   writeLineSeparator,   writePage,   writePageSeperator,   writeString,   writeText,   writeText,   writeWordSeparator
Methods from org.apache.pdfbox.util.PDFStreamEngine:
getColorSpaces,   getCurrentPage,   getFonts,   getGraphicsStack,   getGraphicsState,   getGraphicsStates,   getResources,   getTextLineMatrix,   getTextMatrix,   getTotalCharCnt,   getValidCharCnt,   getXObjects,   processEncodedText,   processOperator,   processOperator,   processStream,   processSubStream,   processTextPosition,   registerOperatorProcessor,   resetEngine,   setColorSpaces,   setFonts,   setGraphicsStack,   setGraphicsState,   setGraphicsStates,   setTextLineMatrix,   setTextMatrix
Methods from java.lang.Object:
clone,   equals,   finalize,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.pdfbox.util.PDFTextStripper Detail:
 protected  void endArticle() throws IOException 
    End an article. Default implementation is to do nothing. Subclasses may provide additional information.
 protected  void endDocument(PDDocument pdf) throws IOException 
    This method is available for subclasses of this class. It will be called after processing of the document finishes.
 protected  void endPage(PDPage page) throws IOException 
    End a page. Default implementation is to do nothing. Subclasses may provide additional information.
 public float getAverageCharTolerance() 
    Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.
 protected List getCharactersByArticle() 
    Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects.
 protected int getCurrentPageNo() 
    Get the current page number that is being processed.
 public PDOutlineItem getEndBookmark() 
    Get the bookmark where text extraction should end, inclusive. Default is null.
 public int getEndPage() 
    This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted.
 public String getLineSeparator() 
    This will get the line separator.
 protected Writer getOutput() 
    The output stream that is being written to.
 public String getPageSeparator() 
    This will get the page separator.
 public float getSpacingTolerance() 
    Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.
 public PDOutlineItem getStartBookmark() 
    Get the bookmark where text extraction should start, inclusive. Default is null.
 public int getStartPage() 
    This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1.
 public String getText(PDDocument doc) throws IOException 
    This will return the text of a document. See writeText.
    NOTE: The document must not be encrypted when coming into this method.
 public String getText(COSDocument doc) throws IOException 
Deprecated!
 public String getWordSeparator() 
    This will get the word separator.
 protected  void processPage(PDPage page,
    COSStream content) throws IOException 
    This will process the contents of a page.
 protected  void processPages(List pages) throws IOException 
    This will process all of the pages and the text that is in them.
 protected  void processTextPosition(TextPosition text) 
    This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
 public  void setAverageCharTolerance(float averageCharToleranceValue) 
    Set the character width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.
 public  void setEndBookmark(PDOutlineItem aEndBookmark) 
    Set the bookmark where the text extraction should stop.
 public  void setEndPage(int endPageValue) 
    This will set the last page to be extracted by this class.
 public  void setLineSeparator(String separator) 
    Set the desired line separator for output text. The line.separator system property is used if the line separator preference is not set explicitly using this method.
 public  void setPageSeparator(String separator) 
    Set the desired page separator for output text. The line.separator system property is used if the page separator preference is not set explicitly using this method.
 public  void setShouldSeparateByBeads(boolean aShouldSeparateByBeads) 
    Set if the text stripper should group the text output by a list of beads. The default value is true!
 public  void setSortByPosition(boolean newSortByPosition) 
    The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text.
    The default is to not sort by position.

    A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons.
 public  void setSpacingTolerance(float spacingToleranceValue) 
    Set the space width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.
 public  void setStartBookmark(PDOutlineItem aStartBookmark) 
    Set the bookmark where text extraction should start, inclusive.
 public  void setStartPage(int startPageValue) 
    This will set the first page to be extracted by this class.
 public  void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue) 
    By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
 public  void setWordSeparator(String separator) 
    Set the desired word separator for output text. The PDFBox text extraction algorithm will output a space character if there is enough space between two words. By default a space character is used. If you need and accurate count of characters that are found in a PDF document then you might want to set the word separator to the empty string.
 public boolean shouldSeparateByBeads() 
    This will tell if the text stripper should separate by beads.
 public boolean shouldSortByPosition() 
    This will tell if the text stripper should sort the text tokens before writing to the stream.
 public boolean shouldSuppressDuplicateOverlappingText() 
 protected  void startArticle() throws IOException 
    Start a new article, which is typically defined as a column on a single page (also referred to as a bead). This assumes that the primary direction of text is left to right. Default implementation is to do nothing. Subclasses may provide additional information.
 protected  void startArticle(boolean isltr) throws IOException 
    Start a new article, which is typically defined as a column on a single page (also referred to as a bead). Default implementation is to do nothing. Subclasses may provide additional information.
 protected  void startDocument(PDDocument pdf) throws IOException 
    This method is available for subclasses of this class. It will be called before processing of the document start.
 protected  void startPage(PDPage page) throws IOException 
    Start a new page. Default implementation is to do nothing. Subclasses may provide additional information.
 protected  void writeCharacters(TextPosition text) throws IOException 
    Write the string in TextPosition to the output stream.
 protected  void writeLineSeparator() throws IOException 
    Write the line separator value to the output stream.
 protected  void writePage() throws IOException 
    This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.
 protected  void writePageSeperator() throws IOException 
    Write the page separator value to the output stream.
 protected  void writeString(String text) throws IOException 
    Write a Java string to the output stream.
 public  void writeText(COSDocument doc,
    Writer outputStream) throws IOException 
Deprecated!
 public  void writeText(PDDocument doc,
    Writer outputStream) throws IOException 
    This will take a PDDocument and write the text of that document to the print writer.
 protected  void writeWordSeparator() throws IOException 
    Write the word separator value to the output stream.