Graeme Posted April 13, 2010 Report Share Posted April 13, 2010 Hi all, I want to be able to extract the text from PDFs from within LV. I thought this might be a common requirement but searching for "PDF" here returns nil, and Googling fairs little better. I've discovered that extracting text from a PDF is probably not easy in any language: the text seems to be contained in a very heavily encoded data stream. I doubt I could write an LV algorithm that could do the extraction well (or even badly, for that matter). My thoughts turn to interfacing with an existent DLL. Numerous PDF to text DLLs exist on the Web but I don't know C and its variants so don't really understand DLLs, to be honest. A compay called Softinterface seem the most likely key to success. They've got some good stuff. I've had a play with some of their DLLs and the LV Import Library wizard which created some VIs but I couldn't get them to do anything (oddly, the header file for the DLL only defined very few of the functions that the DLL appeared to support). I eventually got something going via the afore mentioned comapany's product ConvertDoc. This is a PDF to text GUI application that has a command line. So, I send the command line paramaters to it via the System Exec VI calling the cmd console. It extracts the text from the PDF and writes it to a text file (not surprisingly). LV then reads the text file and does what I want with the text. So, I can do it but only with the clunkiest of methods. Can anyone point me in the direction of a slicker method of extracting the text from a PDF? Many thanks in anticipation. Regards, Graeme. Quote Link to comment
jzoller Posted April 13, 2010 Report Share Posted April 13, 2010 Hi all, I want to be able to extract the text from PDFs from within LV. I thought this might be a common requirement but searching for "PDF" here returns nil, and Googling fairs little better. I've discovered that extracting text from a PDF is probably not easy in any language: the text seems to be contained in a very heavily encoded data stream. I doubt I could write an LV algorithm that could do the extraction well (or even badly, for that matter). My thoughts turn to interfacing with an existent DLL. Numerous PDF to text DLLs exist on the Web but I don't know C and its variants so don't really understand DLLs, to be honest. A compay called Softinterface seem the most likely key to success. They've got some good stuff. I've had a play with some of their DLLs and the LV Import Library wizard which created some VIs but I couldn't get them to do anything (oddly, the header file for the DLL only defined very few of the functions that the DLL appeared to support). I eventually got something going via the afore mentioned comapany's product ConvertDoc. This is a PDF to text GUI application that has a command line. So, I send the command line paramaters to it via the System Exec VI calling the cmd console. It extracts the text from the PDF and writes it to a text file (not surprisingly). LV then reads the text file and does what I want with the text. So, I can do it but only with the clunkiest of methods. Can anyone point me in the direction of a slicker method of extracting the text from a PDF? Many thanks in anticipation. Regards, Graeme. How about pdftotext, from the xpdf tools? http://en.wikipedia.org/wiki/Pdftotext I haven't done this, but a command line invocation doesn't seem unreasonable... though I suspect it depends on the complexity of the pdf document you're looking at. Joe Z. Quote Link to comment
Graeme Posted April 13, 2010 Author Report Share Posted April 13, 2010 (edited) Thanks Joe Z, I'll check this out. You're right in that a command line apparoach is not unreasonable at all, and if it works consistently does it really matter how the job is done. I'd just envisaged that a neat solution might be a wrapper VI just interfacing with a DLL, inputting the traget PDF filename and outputting a pink wire, namely the extracted text. Maybe I'm expecting too much. I always think calling third party applications is a chore because you have to, er...well, install the third party application. That said, if you call a DLL you have to put it somewhere, so maybe there's no difference. Perhaps the ultimate solution is an LV coding challenge! In native LV produce a BD that extracts the text from a PDF though you can count me out here! Ideas still welcome. Thanks guys (and girls). Graeme Edited April 13, 2010 by Graeme Quote Link to comment
jcarmody Posted April 14, 2010 Report Share Posted April 14, 2010 Hi all, I want to be able to extract the text from PDFs from within LV. I found this post a few months back and wrote this VI to extract text from a PDF file. It uses PDFBox, which is "is a Java PDF Library [which] will allow access to all of the components in a PDF document." pdf_play.vi Quote Link to comment
Ava Posted January 8, 2018 Report Share Posted January 8, 2018 Hi, I am trying to extract some text form a pdf file. Attached is a file and a library. in Labwiew 2012, it works a bit until a get error 1172 (below). How do I solve this? in labview 2014, I get the error 1386 (just at the beginning). How do I solve both cases? Thank you, Error 1172 Error calling method org.pdfbox.util.PDFTextStripper.getText, (System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. Inner Exception: System.NullReferenceException:pdf extract.zippdf extract.zip Object reference not set to an instance of an object.) <append><b>System.NullReferenceException</b> in pdfExtract.vi Quote Link to comment
Tim_S Posted January 8, 2018 Report Share Posted January 8, 2018 Tried loading the code in LV2012 and 2015; in both there was an error attempting to load the .NET control from PDFBox-0.7.3.dll. I expect some other DLL is needed, which is a rabbit hole to start going down. With no documentation nor knowledge of the contents of the .NET control it is very difficult to provide suggestions. With LV2012, the .NET control attempts to use an object which is NULL, so it throws an unhandled exception that is reported back up to LabVIEW. There is no information as to what object is the issue. Without being able to see the properties or methods it's impossible to attempt relate what is missing. The error 1386, which I had to look up as "The specified .NET class is not available in LabVIEW.", implies that something is missing or broken in the .NET control. I'm expecting it's missing a file. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.