PDFBox 2.0.3/Java 7 - OOM Error when importing page from one PDF to another - java-7

I have some code that reviews every page in a large PDF (20,000+ pages) and if that page contains a certain String, then it imports that page to another PDF.
Due to the number of occurrences, the PDF that it's being imported into grows almost as large as the source PDF - When it gets too large, it bombs out with the below exception:
Exception in thread "main" java.lang.OutofMemoryError: Java heap space
at java.utils.Arrays.copyOf (Unknown Source)
at java.io.ByteArrayOutputStream.toByteArray (Unknown Source)
at org.apache.pdfbox.cos.COSOutputStream.close(COSOutputStream.java:87)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.cos.COSStream$1.close(COSStream.java:223)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:138)
at org.apache.pdfbox.pdmodel.common.PDStream.<init>(PDStream.java:104)
at org.apache.pdfbox.pdfmodel.PDDocument.importPage(PDDocument.java:562)
at ExtractPage.extractString(ExtractPage.java:57)
at RunApp.run(RunApp.java:15)
I have researched the issue and it looks like the use of a temp file for streaming could resolve my problem. However, i just can't work out how to implement it into my code.
I do have a work around where i would batch the pages into seperate files and then merge them afterwards, using the soultion mentioned here - However, it certainley would be much more effcient and cleaner to avoid this.
Please see a summary of my code below:
File sourceFile = new File (C:\\Temp\\extractFROM.pdf);
PDDocument sourceDocument = PDDocument.load(SourceFile, MemoryUsageSetting.setupTempFileOnly();
PDPageTree sourcePageTree = sourceDocument.getDocumentCatalog().getPages();
PDDocument tempDocument = new PDDocument (MemoryUsageSetting.setupTempFileOnly())
for (PDPage page : sourcePageTree) {
// Code to extract page text and confirm if contains String
if (above psuedo code is true) {
tempDocument.importPage(page);
}
}
tempDocument.save(sourceFile);
Once it's exported around 7000 or so pages, that's when it bombs out at the tempDocument.importPage(page) line. It works perfectly for PDFs below that number.
Can anyone assist?

A program running into an OutofMemoryError might have a memory leak, or it might simply require more memory to run properly.
Thus, one change to try in such a situation is to simply increase the memory assigned to the program. If the program then runs without an issue, you can consider this a fix. As long as the memory assigned does not become completely unreasonable, that is...
This appears to be the case here, as the op confirmed
I have increased the heap as a run configuration to 670mb (The maximum i can secure with my client equipment) and this has successfully resolved the issue - In fact, i tried it on a PDF twice the size as the original failing PDF, and it easily managed this as well.

Related

Visual Studio embed large resource file (almost 4gb)

I am trying to embed a large resource file (almost 4gb), its a .dat file. However i am running into issues where it throws an error
"Error reading resource 'Sx64.x-none.dat' -- 'Specified argument was out of the range of valid values.
It appears there is a limitation to the size of an embedded resource for Visual studio. Would there be a way to increase the max size? or some other work around for this? I am trying not to use a linked resource or have another file being copied around with the exe.
While in the PE format specification the SizeOfImage value is a 32 bit unsigned integer and can theoretically handle up to 4 GiB, in practice the limit for an executable file is lower. Some user here on stackoverflow has tested this behavior. However it's still possible to make an executable bigger and working (on 64 bit Windows only) but the data must be kept outside of the image sections at End Of File, so the loader won't attempt to allocate it. This is a bad practice and I suggest, as suggested by others in comments, to ship it in a separate file along with your executable.

Google Earth plugin - fetchKML() - How to purge cache?

I have a very similar scenario to the one described in
how to add dynamic kml to google earth?
Note: My KML file is fetched every single second. The KML file size is ~1 MB.
When getting the KML updates the url is changed as suggested in the aforementioned thread.
var url = 'test.kml?rnd='+Math.random();
This works perfectly. On the other hand, it causes the geplugin.exe process to consume more and more memory, which leads to a crash of the plugin.
Does anyone run into the same issue? Is there a way to force GE Plugin to purge the cache?
Is there a way to force GE Plugin to purge the cache?
AFAIK there isn't any way to clear the cache from javascript or the API.
My KML file is fetched every single second. The KML file size is ~1
MB.
Fetching a circa 1 MB kml file every second smells. How are you calling fetchKml every second and adding the data to the plugin?
Without actually seeing your code it is impossible to say what is actually happening but this sounds like the root of the problem.
On the other hand, it causes the geplugin.exe process to consume more
and more memory, which leads to a crash of the plugin.
It sounds as if you are creating some objects inside a tight, never ending, loop. Running out of memory would be expected in this case.
You should probably be using Networklinks to load the kml data rather than fetchKml, but again, without seeing your code it is impossible to say.

Reading line by line from blob Storage in Windows Azure

Is there any way to read line by line from a text file in the blob storage in windows Azure??
Thanks
Yes, you can do this with streams, and it doesn't necessarily require that you pull the entire file, though please read to the end (of the answer... not the file in question) because you may want to pull the whole file anyway.
Here is the code:
StorageCredentialsAccountAndKey credentials = new StorageCredentialsAccountAndKey(
"YourStorageAccountName",
"YourStorageAccountKey"
);
CloudStorageAccount account = new CloudStorageAccount(credentials, true);
CloudBlobClient client = new CloudBlobClient(account.BlobEndpoint.AbsoluteUri, account.Credentials);
CloudBlobContainer container = client.GetContainerReference("test");
CloudBlob blob = container.GetBlobReference("CloudBlob.txt");
using (var stream = blob.OpenRead())
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
}
I uploaded a text file called CloudBlob.txt to a container called test. The file was about 1.37 MB in size (I actually used the CloudBlob.cs file from GitHub copied into the same file six or seven times). I tried this out with a BlockBlob which is likely what you'll be dealing with since you are talking about a text file.
This gets a reference to the BLOB as usualy, then I call the OpenRead() method off the CloudBlob object which returns you a BlobStream that you can then wrap in a StreamReader to get you the ReadLine method. I ran fiddler with this and noticed that it ended up calling up to get additional blocks three times to complete the file. It looks like the BlobStream has a few properties and such you can use to tweak the amount of reading ahead you have to do, but I didn't try adjusting them. According to one reference I found the retry policy also works at the last read level, so it won't attempt to re-read the whole thing again, just the last request that failed. Quoted here:
Lastly, the DownloadToFile/ByteArray/Stream/Text() methods performs it’s entire download in a single streaming get. If you use CloudBlob.OpenRead() method it will utilize the BlobReadStream abstraction which will download the blob one block at a time as it is consumed. If a connection error occurs, then only that one block will need to be re-downloaded(according to the configured RetryPolicy). Also, this will potentially help improve performance as the client may not need cache a large amount of data locally. For large blobs this can help significantly, however be aware that you will be performing a higher number of overall transactions against the service. -- Joe Giardino
I think it is important to note the caution that Joe points out in that this will lead to an overall larger number of transactions against your storage account. However, depending on your requirements this may still be the option you are looking for.
If these are massive files and you are doing a lot of this then it could many, many transactions (though you could see if you can tweak the properties on the BlobStream to increase the amount of blocks retrieved at a time, etc). It may still make sense to do a DownloadFromStream on the CloudBlob (which will pull the entire contents down), then read from that stream the same way I did above.
The only real difference is that one is pulling smaller chunks at a time and the other is pulling the full file immediately. There are pros and cons for each and it will depend heavily on how large these files are and if you plan on stopping at some point in the middle of reading the file (such as "yeah, I found the string I was searching for!) or if you plan on reading the entire file anyway. If you plan on pulling the whole file no matter what (because you are processing the entire file for example), then just use the DownloadToStream and wrap that in a StreamReader.
Note: I tried this with the 1.7 SDK. I'm not sure which SDK these options were introduced.
In case anyone finds themselves here, the Python SDK for Azure Blob Storage (v12) now has the simple download_blob() method, which accepts two parameters - offset and length.
Using Python, my goal was to extract the header row from (many) files in blob storage. I knew the locations of all of the files, so I created a list of the blob clients - one for each file. Then, I iterated through the list and ran the download_blob method.
Once you have created a Blob Client (either directly via connection string or using the BlobServiceClient.get_blob_client() method), just download the first (say,) 4k bytes to cover any long header rows, then split the text using an end of line character ('\n'). The first element of the resulting list will be a header row. My working code (just for a single file) looked like:
from azure.storage.blob import BlobServiceClient
MAX_LINE_SIZE = 4096 # You can change this..
my_blob_service_client = BlobServiceClient(account_url=my_url, credential=my_shared_access_key)
my_blob_client = my_blob_service_client.get_blob_client('my-container','my_file.csv')
file_size = my_blob_client.size
offset = 0
You can then write a loop to downloading the text line by by line, by counting the byte offset at the first end-of-line, and getting the next MAX_LINE_SIZE bytes. For optimum efficiency, it'd be nice to know the maximum length of a line, but if you don't, guess a sufficiently large length.
while offset < file_size - 1:
next_text_block = my_blob_client.download_blob(offset=offset, length=MAX_LINE_SIZE)
line = next_text_block.split('\n')[0]
offset = len(line) + 1
# Do something with your line..
Hope that helps. The obvious trade-offs here are network overhead, each call for a line of text is not fast, but it achieves your requirement of reading line-by-line.
To directly answer your question, you will have to write code to download the blob locally first and then read the content in it. This is mainly because you can not just peak into a blob and read its content in middle. IF you have used Windows Azure Table Storage, you sure can read the specific content in the table.
As your text file is a blob and located at the Azure Blob storage, what you really need is to download the blob locally (as local blob or memory stream) and then read the content in it. You will have to download the blob full or partial depend on what type of blob you have uploaded. With Page blobs you can download specific size of content locally and process it. It would be great to know about difference between block and page blob on this regard.
This the code I used to fetch a file line by line. The file was stored in Azure Storage. File service was used and not blob service.
//https://learn.microsoft.com/en-us/azure/storage/storage-dotnet-how-to-use-files
//https://<storage account>.file.core.windows.net/<share>/<directory/directories>/<file>
public void ReadAzureFile() {
CloudStorageAccount account = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudFileClient fileClient = account.CreateCloudFileClient();
CloudFileShare share = fileClient.GetShareReference("jiosongdetails");
if (share.Exists()) {
CloudFileDirectory rootDir = share.GetRootDirectoryReference();
CloudFile file = rootDir.GetFileReference("songdetails(1).csv");
if (file.Exists()) {
using(var stream = file.OpenRead()) {
using(StreamReader reader = new StreamReader(stream)) {
while (!reader.EndOfStream) {
Console.WriteLine(reader.ReadLine());
}
}
}
}
}

What can lead to failures in appending data to a file?

I maintain a program that is responsible for collecting data from a data acquisition system and appending that data to a very large (size > 4GB) binary file. Before appending data, the program must validate the header of this file in order to ensure that the meta-data in the file matches that which has been collected. In order to do this, I open the file as follows:
data_file = fopen(file_name, "rb+");
I then seek to the beginning of the file in order to validate the header. When this is done, I seek to the end of the file as follows:
_fseeki64(data_file, _filelengthi64(data_file), SEEK_SET);
At this point, I write the data that has been collected using fwrite(). I am careful to check the return values from all I/O functions.
One of the computers (windows 7 64 bit) on which we have been testing this program intermittently shows a condition where the data appears to have been written to the file yet neither the file's last changed time nor its size changes. If any of the calls to fopen(), fseek(), or fwrite() fail, my program will throw an exception which will result in aborting the data collection process and logging the error. On this machine, none of these failures seem to be occurring. Something that makes the matter even more mysterious is that, if a restore point is set on the host file system, the problem goes away only to re-appear intermittently appear at some future time.
We have tried to reproduce this problem on other machines (a vista 32 bit operating system) but have had no success in replicating the issue (this doesn't necessarily mean anything since the problem is so intermittent in the first place.
Has anyone else encountered anything similar to this? Is there a potential remedy?
Further Information
I have now found that the failure occurs when fflush() is called on the file and that the win32 error that is being returned by GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). Searching google for this error leads to a bunch of reports related to "extents" for SQL server files. I suspect that there is some sort of journaling resource that the file system is reporting and this because we are growing a large file by opening it, appending a chunk of data, and closing it. I am now looking for understanding regarding this particular error with the hope for coming up with a valid remedy.
The file append is failing because of a file system fragmentation limit. The question was answered in What factors can lead to Win32 error 665 (file system limitation)?

Page error 0xc0000006 with VC++

I have a VS 2005 application using C++ . It basically importing a large XML of around 9 GB into the application . After running for more than 18 hrs it gave an exception 0xc0000006 In page error. THe virtual memory consumed is 2.6 GB (I have set the 3GB) flag.
Does any one have a clue as to what caused this error and what could be the solution
Instead of loading the whole file into the memory you can use SAX parsers to load only a part of the file to the memory.
9Gb seems overly large to read in. I would say that even 3Gb is too large in one go.
Is your OS 64bit?
What is the maximum pagefile size set to?
How much RAM do you have?
Were you running this in debug or release mode?
I would suggest that you try to reading the XML in smaller chunks.
Why are you trying to read in such a large file in one go?
I would imagine that your application took so long to run before failing as it started to copy the file into virtual memory, which is basically a large file on the hard disk. Thus the OS is reading the XML from the disk and writing it back onto a different area of disk.
**Edit - added text below **
Having had a quick peek at Expat XML parser it does look as if you're running into problems with stack or event handling, most likely you are adding too much to the stack.
Do you really need 3Gb of data on the stack? At a guess I would say that you are trying to process a XML database file, but I can't imagine that you have a table row that is so large.
I think that really you should use it to search for key areas and discard what is not wanted.
I know nothing other than what I have just read about Expat XML Parser but would suggest that you are not using it in the most efficient manner.

Resources