Reasons to use SequenceFile instead of text file in Hadoop - hadoop

What are the reasons to use SequenceFile instead of a text file?
I'm guessing that they are good since input/output comes to serialization, instead of parsing an object, if that object needs to be used multiple times.
Also, I read that it performs compression of the file, so it takes less space and that it is good to aggregate many small files into one large one.
Are this arguments valid and what else?

Binary data (as in SequenceFiles) is usually more compact than text data (TextFiles) even without explicit compression. So less data needs to be read from/written to the hard disks. The space savings depend on the data that is written.
Reading binary data is more CPU efficient than String parsing.
However,
SequenceFiles cannot be read humans and
are bound to a specific object type / class, whereas text data can be interpreted in different ways as needed.

Related

Are files data structures?

I'm new to programming and I'd like to know if files such as BMPs, MP3s, EXEs are considered to be data structures as well.
I'm new to programming and I'd like to know if files such as BMPs, MP3s, EXEs are considered to be data structures as well.
No, they are some form of compressed(or not) data that should be read on any kind of program that can read them.
But they are structured data. That means you have some specific way your program should read them. For example, in bmp you should know how to read its width and height of the image, then start reading its pixels. Then you continue looping it until its over.
There is more complexes structured datas, as exe's, which your operating system reads, or mp3 which you have to execute some algorithms to make the data understandable
Data structures are actually some default way to think about how to store and read your data and use them for specific situations, such as a command history.
The well known command CTRL+Z, CTRL+SHIFT+Z, (undo & redo), they are implemented as stacks, which means each command is piled up one above the other, when undoing you will need to take the command that is topmost in the stack, pop it and execute its undo function.
Not really. Frow Wikipedia, "In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data."
You normally read or write such files as a whole and do not perform local modifications. Anyway, for some formats (such as TIFF images for instance), the individual data fields can be accessed directly rather than sequentially.

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

Avro size too big?

I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks
It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.

Preserving Mathematica expressions in a textual form

What is the proper way to convert Mathematica expressions losslessly to a string (a string kept in memory, not exported to a file)?
I am looking for a textual representation that
will preserve all information, including keeping special (and possibly atomic) objects, such as SparseArray, Graph, Dispatch, CompiledFunction, etc. intact. E.g. cycling a SparseArray through this representation should keep it sparse (and not convert it to a normal list).
is relatively fast to cycle through (convert back and forth).
Is ToString[expr, FullForm] sufficient for this? What about ToString[expr, InputForm]?
Note 1: This came up while trying to work around some bugs in Graph where the internal representation gets corrupted occasionally. But I'm interested in an answer to the general question above.
Note 2: Save will surely do this, but it writes to files (probably possible to solve this using streams), and it only write definitions associated with symbols.
If you are not going to perform some string manipulations on the resulting string, you may consider Compress and Uncompress as an alternative to ToString. While I don't know about cases where ToString[expr,InputForm] - ToExpression cycles would break, I can easily imagine that they exist. The Compress solution seems more robust, as Uncompress invoked on Compress-ed string is guaranteed to reconstruct the original expression. An added advantage of Compress is that it is pretty memory-efficient - I used it a few times to save large amounts of numerical data in the notebook, without saving them to disk.
Should Compress exhibit round-tripping problems, ExportString and ImportString might present a useful alternative -- particularly, if they are used in conjunction with the Mathematica-native MX format:
string = ExportString[originalExpr, "MX"]
recoveredExpr = ImportString[string, "MX"]
Note that the MX format is not generally transferable between Mathematica instances, but that might not matter for the described in-memory application.
ExpressionML is another Mathematica-related export format, but it is distinctly not a compact format.

Ways of Efficiently Seeking in Custom File Formats

I've been wondering what kind of ways seek is implemented across different file formats and what would be a good way to construct a file that has a lot of data to enable efficient seeking. Some ways I've considered have been having equal sized packets, which allows quick skipping since you know what each data chunk is like, also preindexing whenever a file is loaded is also a thought.
This entirely depends on the kind of data, and what you're trying to seek to.
If you're trying to seek by record index, then sure: fixed size fields makes life easier, but wastes space. If you're trying to seek by anything else, keeping an index of key:location works well. If you want to be able to build the file up sequentially, you can put the index at the end but keep the first four bytes of the file (after the magic number or whatever) to represent the location of the index itself (assuming you can rewrite those first four bytes).
If you want to be able to perform a sort of binary chop on variable length blocks, then having a reasonably efficient way of detecting the start of a block helps - as does having next/previous pointers, as mentioned by Alexander.
Basically it's all about metadata, really - but the right kind of metadata will depend on the kind of data, and the use cases for seeking in the first place.
Well, giving each chunk a size offset to the next chunk is common and allows fast skipping of unknown data. Another way would be an index chunk at the beginning of the file, storing a table of all chunks in the file along with their offsets. Programs would simply read the index chunk into memory.

Resources