I am checking if a PDF document is searchable if I can get any text from every single page in a PDF.
But checking every page seems to take forever when I am trying to extract text from a PDF that contains more than 500~2000 pages.
Is it possible for a PDF to contain text for one page but not in the rest?
What I am trying to do here is that, if a first page of PDF contains text, then it is a searchable PDF else not..
Yes, it is very possible for a PDF to contain text on one page but not the rest. You could very well have a 500 page PDF that contains images on the first 499 pages, but contain text on the last page.
Unless you want to open the PDF file yourself and scan it for text/text operations, you will need to use an existing third-party PDF library that allows you to extract text from a PDF.
Also, see Ferruccio's response to a related question, which is to use the IFilter interface, specifically made for search indexing and text extraction.
Try this version of Searcharoo, which lets you search Word and PDF documents.
Related
W.r.t. JMeter document upload and download,
I would like to know, Can we validate and do ample to ample content comparison (e.g. same Text, Space, Lines, Images etc) of PDF document which is converted using Libra Office/PDF Box in Document upload scenarios from different type of documents like Doc/Docx/Text/Jpg/Png/Rtf etc
Scenario-
Upload a Docx document ( Document should convert in PDF Format and user can view the same in pdf)
View the Docx Document in PDF Format after document upload
-Compare the Docx Contents (e.g. Text, Space, Lines, Images etc) in PDF doc, is same or not
Take a look at Apache Tika project, if you add the tika-app.jar to JMeter Classpath you would be able to use Document (text) "Field to Test" of the Response Assertion:
so you can check document content against reference text.
If it is not sufficient for your needs take a look at JSR223 Test Elements, Groovy language, Apache POI and Apache PDFBox APIs
I have a Laravel project where I need to create a doc/docx document based on user input in Ckeditor. I have previously worked with PHPword where I can convert simple text input to a docx document. But the problem with ckeditor is it gives you html with inline css (which i need) and PHPWORD can not convert this to a docx.
I also tried to convert the html to word by xml but no luck. I know there is a paid tool called phpdocx but I am looking for a free solution.
Just a note, I can actually convert the html to pdf. But again, there is no solution from pdf to doc.
So, any help in converting the html to word or pdf to word?
thanks
I am using ActivePDF tool to convert few different file formats to PDF. Before this conversion, I need to find out how many pages of PDF I will end up with. So, say my word document is converted to 4 page pdf, I need to get that count of pages before the actual conversion.
How can I best achieve this?
Is it possible to add two html string in one page ?
When the pdf is rendered, I can only see the second html. It seems that the first one is overwritten
The only way I have found that you can do this is to add each HTML to its own document then merge both documents together
theDoc.AddImageUrl(Url1);
theDoc2.AddImageUrl(Url2);
theDoc.Append(theDoc2);
I would like to have a two page Indesign document. First page has text + image and second page has 2 images. The images should come from a csv file that gets data merged with the Indesign document. Is this achievable. I have only been able to do a data merge when I have one page, but then all pages have the same layout. Is this possible and how do I do it? Thanks.
The solution is to work with XML files (File -> Import XML) instead of CSV. You can make any type of document and put XML objects in text fields, images, ... XML is much more flexible than CSV.
a CSV field that is named #photo1 will link to a file location
column F (for instance)
#photo1
c:\foldername\filename.jpg
c:\folder2name\subfolder\otherfile.png
f:\file3.tiff