Jump to content

PDF extract text


Recommended Posts

Posted

Hi all,

I want to be able to extract the text from PDFs from within LV. I thought this might be a common requirement but searching for "PDF" here returns nil, and Googling fairs little better. I've discovered that extracting text from a PDF is probably not easy in any language: the text seems to be contained in a very heavily encoded data stream. I doubt I could write an LV algorithm that could do the extraction well (or even badly, for that matter). My thoughts turn to interfacing with an existent DLL. Numerous PDF to text DLLs exist on the Web but I don't know C and its variants so don't really understand DLLs, to be honest.

A compay called Softinterface seem the most likely key to success. They've got some good stuff. I've had a play with some of their DLLs and the LV Import Library wizard which created some VIs but I couldn't get them to do anything (oddly, the header file for the DLL only defined very few of the functions that the DLL appeared to support).

I eventually got something going via the afore mentioned comapany's product ConvertDoc. This is a PDF to text GUI application that has a command line. So, I send the command line paramaters to it via the System Exec VI calling the cmd console. It extracts the text from the PDF and writes it to a text file (not surprisingly). LV then reads the text file and does what I want with the text.

So, I can do it but only with the clunkiest of methods. Can anyone point me in the direction of a slicker method of extracting the text from a PDF? Many thanks in anticipation.

Regards, Graeme.

Posted

Hi all,

I want to be able to extract the text from PDFs from within LV. I thought this might be a common requirement but searching for "PDF" here returns nil, and Googling fairs little better. I've discovered that extracting text from a PDF is probably not easy in any language: the text seems to be contained in a very heavily encoded data stream. I doubt I could write an LV algorithm that could do the extraction well (or even badly, for that matter). My thoughts turn to interfacing with an existent DLL. Numerous PDF to text DLLs exist on the Web but I don't know C and its variants so don't really understand DLLs, to be honest.

A compay called Softinterface seem the most likely key to success. They've got some good stuff. I've had a play with some of their DLLs and the LV Import Library wizard which created some VIs but I couldn't get them to do anything (oddly, the header file for the DLL only defined very few of the functions that the DLL appeared to support).

I eventually got something going via the afore mentioned comapany's product ConvertDoc. This is a PDF to text GUI application that has a command line. So, I send the command line paramaters to it via the System Exec VI calling the cmd console. It extracts the text from the PDF and writes it to a text file (not surprisingly). LV then reads the text file and does what I want with the text.

So, I can do it but only with the clunkiest of methods. Can anyone point me in the direction of a slicker method of extracting the text from a PDF? Many thanks in anticipation.

Regards, Graeme.

How about pdftotext, from the xpdf tools?

http://en.wikipedia.org/wiki/Pdftotext

I haven't done this, but a command line invocation doesn't seem unreasonable... though I suspect it depends on the complexity of the pdf document you're looking at.

Joe Z.

Posted (edited)

Thanks Joe Z, I'll check this out. You're right in that a command line apparoach is not unreasonable at all, and if it works consistently does it really matter how the job is done. I'd just envisaged that a neat solution might be a wrapper VI just interfacing with a DLL, inputting the traget PDF filename and outputting a pink wire, namely the extracted text. Maybe I'm expecting too much.

I always think calling third party applications is a chore because you have to, er...well, install the third party application. That said, if you call a DLL you have to put it somewhere, so maybe there's no difference. Perhaps the ultimate solution is an LV coding challenge! In native LV produce a BD that extracts the text from a PDF yes.gif though you can count me out here! Ideas still welcome. Thanks guys (and girls).

Graeme

Edited by Graeme
Posted

Hi all,

I want to be able to extract the text from PDFs from within LV.

I found this post a few months back and wrote this VI to extract text from a PDF file. It uses PDFBox, which is "is a Java PDF Library [which] will allow access to all of the components in a PDF document."

pdf_play.vi

  • 7 years later...
Posted

Hi,

I am trying to extract some text form a pdf file. Attached is a file and a library. in Labwiew 2012, it works a bit until a get error 1172 (below). How do I solve this?

in labview 2014, I get the error 1386 (just at the beginning). How do I solve both cases?

Thank you,

Error 1172

 

Error calling method org.pdfbox.util.PDFTextStripper.getText, (System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.

Inner Exception: System.NullReferenceException:pdf extract.zippdf extract.zip Object reference not set to an instance of an object.) <append><b>System.NullReferenceException</b> in pdfExtract.vi

Posted

Tried loading the code in LV2012 and 2015; in both there was an error attempting to load the .NET control from PDFBox-0.7.3.dll. I expect some other DLL is needed, which is a rabbit hole to start going down.

With no documentation nor knowledge of the contents of the .NET control it is very difficult to provide suggestions. With LV2012, the .NET control attempts to use an object which is NULL, so it throws an unhandled exception that is reported back up to LabVIEW. There is no information as to what object is the issue. Without being able to see the properties or methods it's impossible to attempt relate what is missing.

The error 1386, which I had to look up as "The specified .NET class is not available in LabVIEW.", implies that something is missing or broken in the .NET control. I'm expecting it's missing a file.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.