Get all PDF strings

Not all PDF document strings are easy to extract, some of them are part of an embedded image, in that case we have to use external OCR tools.

If the string is selectable in acrobat reader, you can use these code methods to extract it for key indexing, searching e.t.c

We use the famous itextsharp library.

------------------------------------------------

PdfReader reader = new PdfReader(src);

byte[] streamBytes = reader.GetPageContent(1);
PRTokeniser tokenizer = new PRTokeniser(streamBytes);
StreamWriter streamWriter = new StreamWriter(dest, true);
while (tokenizer.NextToken())
{
 if (tokenizer.TokenType == PRTokeniser.TokType.STRING)
 {
 streamWriter.Write(tokenizer.StringValue);
 }
}
streamWriter.Flush();
streamWriter.Close();

---------------------------------------------------------------------
Another way is to use the parser

---------------------------------------------------------------------

PdfReaderContentParser parser = new PdfReaderContentParser(reader);
LocationTextExtractionStrategy strategy = parser.ProcessContent(1, new LocationTextExtractionStrategy());
string s = strategy.GetResultantText();
---------------------------------------------------------------------
The high level API will not suffice here, you will have to modify the LocationtextExtractionStrategy to fit your needs.

Post a Comment

Previous Post Next Post