Apache Tika performance impact due to Tesseract - performance

We are using Tika 2.4.0 and we scan hundreds of file to extract content from file, we have combination of file like pdfs, documents (docx) and plain text(.txt) files.
pdfs and docx can have only text or both text and images or only images.
Since we are using ocr to extract content from images, we observe that it is having more resource consumption in terms of memory and cpu(although this is expected behavior). But due to this we see processing of other files is also impacted due to choking of resources.
Is there a way we can have tesseract to run in different machine and give call to it on need basis. Little skeptical on this point as we provide tesseract path in tika config and its not service call.
Or any other solution recommended, to overcome such performance impact, so that when one file is under OCR'ing other files are still processed.

Related

is there a way to tag video files independent of any tool

We are in the 21st century and still there is no good way to tag photos and videos? There is always a dependency on some tool... Is there no way to make the file autonomous with respect to its tags?
Video files, for example, are not friendly to tags. some video formats do not allow tagging at all. Some tools keep the meta data in their own external representation and when you copy the original file to some new destination, the meta data of the file in the destination is lost. Also this metadata may only be seen by this proprietary tool and is not seen by other tools (e.g. tagging by Adobe products are not visible/searchable in Windows Explorer)
Is there a universal way to tag any file including video files so that
searching for files having a certain tag is possible in any tool
when a file is copied, the tags are transferred with it
when the file is edited in any tool and re-saved, the tags are not lost...?
There are no universal ways at this point, if there ever will be one.
Probaby the closest we got is file tagging provided by popular OSes based on a certain file systems' feature called 'forking'. By this means Windows and Mac provide an ability to easily add meta data (including keywords) to any file on the file system, without changing the file's content. One serious drawback of this feature is that it does not cross file-system's boundary, i.e. if you simply upload a file to the web, or copy it to a different type filesystem - the metadata will be lost. There are ways to copy such metadata but that requires consideration and use of appropriate tools.

Creating a variable zip archive on the fly, estimating file size for content-length

I'm maintaining a site where users can place pictures and other files in a kind of shopping cart. After selecting all the various contents the user wishes to download, he can checkout. Till' now an archive was generated beforehand and the user got an email with the link to the file after the generation finished.
I've changed this now by using web api and push stream to directly generate the archive on the fly. My code is offering either a zip, a zip64 or .tar.gz dynamically, depending on the estimated filesize and operating system. For performance reasons compression ist set to best speed ('none' would make the zip archives incompatible with Mac OS, the gzip library I'm using doesn't offer none).
This is working great so far, however the user is no longer having a progress bar while downloading the file because I'm not setting the content-length. What are the best ways to get around this? I've tried to guess the resulting file size, but either the browsers are canceling the downloads to early or stopping at 99,x% and are waiting for the missing bytes resulting for the difference between the estimated and actual file size.
My first thought was to guess the resulting file size always a little bit to big and filling the rest with zeros?
I've seen many file hosters offering the possibility to select files from a folder and putting them into a zip file and all are having the correct (?) file size with them and a progress bar. Any best practises? Thanks!
This is just some thoughts, maybe you can use them :)
Using Web API/HTTP the normal way to go about is that the response contains the lenght of the file. Since the response is first received after the call has finished, the actual time for generating the file will not show any progress bar in any browser other than a Windows wait cursor.
What you could do is using a two steps approach.
Generating the zip file
Create a duplex like channel using SignalR to give feedback on the file generation.
Downloading the zip file
After the file is generated you should know the file size, and the browser will show a progress bar while downloading.
It looks that this problem should have been addressed using chunk extensions, but it seems to never got further than a draft.
So I guess you are stuck with either no progress or sending the file size up front.
It seems that generating exact size zip archives is trickier than adding zero padding.
Another option might be to pre-generate the zip file without storing it just to determine the size.
But I am just wondering why not just use tar? It has no compression, so it is easy determine it's final size up front from the size of individual files and it should be also supported by both OSx and Linux. And Windows should be able to handle none compressed zip archives, so a similar trick might work as well.

Get website image info/metadata

So I have scoured google for mention of anybody trying to use powershell to get information about files from a URL/URI but with no luck. I have found ways to get metadata of files from a local source but nothing for an image hosted on a website.
What I want to do:
I have a list of image URL's eg. www.website/images/img.jpg and want to grab the metadata without having to download the entire image. I would then store and export this info to a csv to look over later.
So far my code has been resigned to System.Net.Webclient.DownloadFile() and then operating on them locally. Is it possible to do this remotely?
I suppose you're referring to EXIF metadata. Those are embedded in the file, so unless the remote host provides an API that exposes this information you must download the file to be able to read the information.
Judging from what I gleaned from the standard the information is stored at the beginning of the file, so you could try to download just the first couple hundred bytes. However, the size of the EXIF header doesn't seem to be fixed, so you'll want to retrieve a large enough chunk. Also, standard EXIF parsers might not work on incomplete images, so you might need to write your own parser.
All in all I'd say downloading the entire file and extracting the information with standard tools is your best option.

Preparing PDFs for use as Prawn templates

We've got a system that takes in a large variety of PDFs from unknown sources, and then uses them as templates for new PDFs generated by Prawn.
Occasionally some PDFs don't work as templates for Prawn- they either trigger a generic Prawn error ("Prawn::Errors::TemplateError => Error reading template file. If you are sure it's a valid PDF, it may be a bug.") or the resulting PDF comes out malformed.
(It's a known issue that some PDFs don't work as templates in Prawn, so I'm not trying to address that here:
[1]
[2])
If I take any of the problematic PDFs, and manually re-save them on my Mac using Preview > Save As [new PDF], I can then always use them as Prawn templates without any problem.
My question is, is there some (open source) server-side utility I can use that might be able to do the same thing- i.e. process problematic PDFs into something Prawn can use?
Yarin, it at least partially depends on why the PDFs don't work in the first place. If you can use them after re-saving with Apple's (quite bad) preview PDF code, you should be able to get the same result using a number of different tactics:
-) Use an actual PDF library to open and save the PDF files (libraries from Adobe and Global Graphics come to mind). These are typically commercial products but (I know the Adobe library the best) they do allow you to open a file and save it, performing a number of optimisations in the process. The Adobe libraries are currently licensed through a company called DataLogics (http://www.datalogics.com)
-) Use a commercial product that embeds these libraries. callas pdfToolbox comes to mind (warning, I'm affiliated with this product). This basically gives you the same possibilities as the previous point, but in a somewhat easier to use package (command-line use for example).
-) Use an open source product. I'm not very well positioned to provide useful links for that.
There is another approach that may work depending on your workflow and files. In graphic arts bad files are sometimes "made better" by a process called re-distilling; you basically convert the PDF file to PostScript and re-distill the postscript into PDF again. Because this rewrites the whole file structure, it often fixes fundamental problems. However, it also comes with risks as you're going through a different file format. Libraries such as GhostScript (watch the licensing conditions) may allow you to do this.
Given that your files seem to be fixed simply by using preview, I would think a redistilling approach would be overly dangerous and overkill. I would look into finding a good PDF library that can automatically open and save your files.

Merging PDFs skipping corrupted PDFs

Currently I am using Ghostscript to merge a list of PDFs which are downloaded. The issue is if any 1 of the pdf is corrupted, it stops the merging of the rest of the pdfs.
Is there any command which i must use so that it will skip the corrupted pdfs and merge the others?
I have also tested with pdftk but facing the same issue.
Or is there any other command line based pdf merging utility that I can use for this?
You could try MuPDF, you could also try using MUPDF 'clean' to repair files before you try merging them. However if the PDF file is so badly corrupted that Ghostscript can't even repair it that probably won't work either.
There is no facility to ignore PDF files which are so badly corrupted they can't even be repaired. Its hard to see how this could work in the current scheme, since Ghostscript doesn't 'merge' files anyway, it interprets them, creating a brand new PDF file from the sequence of graphic operations. When a file is badly enough corrupted to provoke an error we abort because we may have already written any parts of the file we could, and if we tried to ignore and continue both the interpreter and the output PDF file would be in an indeterminate state.

Resources