Is it true WebHDFS does not support SequenceFiles?
I can't find anything that says it does. I have the usual small file problem and believe SequenceFiles would work well enough, but I need to use WebHDFS. I need to create and then append to a SequenceFile via WebHDFS.
I think it's true. There is no web API to append to a sequence file.
However you can append binary data, and if your sequence file is not block-compressed, you should be able to format your data on the client with relatively little effort. You can do it by running your input through a sequence file writer on the client, and then using the output for uploading (either the whole file, or a slice representing the delta since last append).
You can read more about sequence file format here.
Related
I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.
I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.
I have a large number of input files in a proprietary binary format. I need to turn them into rows for further processing. Each file must be decoded in one shot by an external binary (i.e. files must not be concatenated or split).
Options that I'm aware of:
Force single file load, extend RecordReader, use DistributedCache to run the decoder via RecordReader
Force single file load, RecordReader returns single file, use hadoop streaming to decode each file
It looks however like [2] will not work since pig will concatenate records before sending them to the STREAM operator (i.e. it will send multiple records).
[1] seems doable, just a little more work.
Is there a better way?
Seems like Option 1 that you mentioned is the most viable option. In addition to extending RecordReader, appropriate InputFormat should be extended and override the isSplitable() to return false
I'm streaming data in a pig script through an executable that returns an xml fragment for each line of input I stream to it. That xml fragment happens to span multiple lines and I have no control whatsoever over the output of the executable I stream to
In relation to Use Hadoop Pig to load data from text file w/ each record on multiple lines?, the answer was suggesting writing a custom record reader. The problem is, this works fine if you want to implement a LoadFunc that reads from a file, but to be able to use streaming, it has to implement StreamToPig. StreamToPig allows you to only read one line at a time as far as I understood
Does anyone know how to handle such a situation?
If you are absolutely sure, then one option is to manage it internally to the streaming solution. That is to say, you build up the tuple yourself, and when you hit whatever your desired size is, you do the processing and return a value. In general, evalfuncs in pig have this issue.
Typically in a the input file is capable of being partially read and processed by Mapper function (as in text files). Is there anything that can be done to handle binaries (say images, serialized objects) which would require all the blocks to be on same host, before the processing can start.
Stick your images into a SequenceFile; then you will be able to process them iteratively, using map-reduce.
To be a bit less cryptic: Hadoop does not natively know anything about text and not-text. It just has a class that knows how to open an input stream (hdfs handles sticthing together blocks on different nodes, to make them appear as one large files). On top of that, you have an Reader and an InputFormat that knows how to determine where in that stream records start, where they end, and how to find the beginning of the next record if you are dropped somewhere in the middle of the file. TextInputFormat is just one implementation, which treats newlines as record delimiter. There is also a special format called a SequenceFile that you can write arbitrary binary records into, and then get them back out. Use that.