Handling large files with Azure search blob extractor

Handling large files with Azure search blob extractor - azure-blob-storage

Receiving errors from the Blob extractor that files are too large for the current tier, which is basic. I will be upgrading to a higher tier, but I notice that the max size is currently 256MB.
When I have PPTX files that are mostly video and audio, but have text I'm interested in, is there a way to index those? What does the blob extractor max file size actually mean?
Can I tell the extractor to only take the first X MB or chars and just stop?

There are two related limits in the blob indexer:
Max file size limit that you are hitting. If file size exceeds that limit, indexer doesn't attempt to download it and produces an error to make sure you are aware of the issue. The reason we don't just take first N bytes is because for parsing many formats correctly, the entire file is needed. You can mark blobs as skipable or configure indexer to ignore a number of errors if you want it to make forward progress when encountering blobs that are too large.
The max size of extracted text. In case file contains more text than that, indexer takes N characters up to the limit and includes a warning so you can be aware of the issue. Content that doesn't get extracted (such as video, at least today) doesn't contribute to this limit, of course.
How large are the PPTX you need indexed? I'll add my contact info in a comment.

Related

Error while extracting CSV from Excel file (Apache NiFi)

I'm using Apache NiFi version 1.16.3 and trying to extract .csv from Excel file (.xlsx) with ConvertExcelToCSVProcessor. Size of the .xlsx is 17 MB, but I can't share it here.
I receive an error:
org.apache.nifi.processors.poi.ConvertExcelToCSVProcessor.error
Tried to allocate an array of length 101,695,141, but the maximum length for this record type is 100,000,000. If the file is not corrupt or large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
What can I do with it? It says
consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
as temprorary workaround, but where can I find this option? Seems like I should write custom processor with this option or what?

UIPath truncating strings at 10k (10,000) characters

We are running into an issue with UIPath that recently started. It's truncating strings, in our case a base 64 encoded image, at 10k characters. Does anyone know why this might be happening, and how we can address it?
The truncation appears to be happening when loading the text variable base64Contents. Seen in the photo below.
base64Contents = Convert.ToBase64String(byteArray);

As per the UiPath documentation there is a limit of 10,000 characters. This is due to 'the default communication channel between the Robot Executor and the Robot Service has changed from WCF to IPC'
https://docs.uipath.com/activities/docs/log-message
Potential Solution
A way round this could be to write your string to a txt file rather than output it as a log. that way you are using a different activity and the 10,000 character limit may not apply.

Force S3 multipart uploads

This is a follow-up to knt's question about PUT vs POST, with more details. The answer may be independently more useful to future answer-seekers.
can I use PUT instead of POST for uploading using fineuploader?
We have a mostly S3-compatible back-end that supports multipart upload, but not form POST, specifically policy signing. I see in the v5 migration notes that "even if chunking is enabled, a chunked upload request is only sent to traditional endpoints if the associated file must be broken into more than 1 chunk". How is the threshold determined for whether a file needs to be chunked? How can the threshold be adjusted? (or ideally, set to zero)
Thanks,

Fine Uploader will chunk a file if its size is less than the number of bytes specified in the chunking.partSize option (default value is: 2000000 bytes). If your file is smaller than the size specified in that value, then it will not be chunked.
To effectively set it to "zero", you could just increase the partSize to an extremely large value. I also did some experimenting, and it seems like a partSize of -1 will make Fine Uploader NOT chunk files at all. AFAIK, that is not supported behavior, and I have not looked at why that is even possible.
Note that S3 requires chunks to be a minimum of 5MB.
Also, note that you may run into limitations on request size as imposed by certain browsers if you make partSize extremely large.

Sending file in chunk always crashes at 10th chunk

I have a strange problem with my ultra-simple method. It sends a file in 4MB chunks to foreign API. The thing is, always at 10th chunk, foreign API crashes.
It's impossible to debug the API error but it says: The specified blob or block content is invalid (That API is Azure Storage API but it's not important right now, the problems lays clearly on my side).
Because it crashes at 10th element (which is 40th megabite) it's a pain to test it and debugging it "by hand" takes a lot of time (partly in cause of my bad internet connection speed) i decided to share my method
def upload_chunk()
file_to_send = File.open('file.mp4', 'rb')
until file_to_send.eof?
#content = file_to_send.read 4194304 # Get 4MB chunk
upload_to_api(#content) # Line that produces the error
end
end
Can you see anything, that can be wrong with this code? Please have in mind that it ALWAYS crashes at 10th time and works perfectly for files of size lesser than 40 MB.

I did a search for ruby "The specified blob or block content is invalid" and found this as the second link (first was this page):
http://cloud.dzone.com/articles/azure-blob-storage-specified
This contains:
If you’re uploading blobs by splitting blobs into blocks and you get the above mentioned error, ensure that your block ids of your blocks are of same length. If the block ids of your blocks are of different length, you’ll get this error.
So my first guess is that the call to upload_to_api is assigning ids from 1-9, then when it goes to 10 the id length increases causing the problem.
If you don't have control over how the ids are generated, then perhaps you can set the amount of bytes read on each iteration to be no more than 1/9 of the total file size.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?

As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.

Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.

Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!

An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.

First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.

Check Solid File System - it seems to be what you need.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio