Why videos are unstructured data in context of Big data? - hadoop

I am trying to delve into Big data, and few of the terms which I came across are structured and unstructured data. I understood what it means to be structured and unstructured data`.
I am having difficulty in understand as to why Videos and photos fall under the category of unstructured data.
Can anyone please help me understand this?

Most definitions of 'structured' data refer to data with a high degree of organization, usually meaning a predefined data schema. A schema generally consists of a number of fields in a specific order, each containing just one type of data, much like a classic DB table:
userId,username,age,location,joinedOn
12,"Polly",20,"Washington DC","2016-02-23 13:34:01"
14,"Dan",19,"San Diego CA","2016-11-10 18:32:21"
15,"Shania",36,"","2017-01-04 10:46:39"
In this case, you have two String fields, two Integer fields, and a Date/Time-type field. In a Big Data context, this allows for convenient data querying/processing, vastly improved compression, as well as efficient storage. All of which can be difficult problems, in particular as data volumes get larger.
Now consider images, which can be represented in many different ways: Simple bitmaps, vectors, progressive JPEGs, formats with built-in variable compression, fractals, containers of animation frames, etc. Not only this, but images have different sizes, color palettes, and metadata, and all of this variation means you can't really treat two images with different properties as one data schema (meaning you don't get the benefits of column-oriented storage, compression, or querying).
As for videos, all of the above is still true, except you have container formats which can contain multiple different video (and audio) codecs and compressions inside, adding further complexity.

Related

Can Parquet be used to store images? Are there any benefits?

I understand how parquet works for tabular data and json data.
I'm struggling to understand if/how parquet manages binary images like png files?
Are there any benefits?
Open to moving this question elsewhere, I just couldn't see another community from stack that made sense
Parquet can store arbitrary byte strings so it can support storing images but it there are no particular benefits for doing so and most bindings aren't necessarily geared to handle very large row sizes, so an image per row could run into some unexpected performance or scalability issues.

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

Avro size too big?

I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks
It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.

Sorting a file to optimize for compression efficiency

We have some large data files that are being concatenated, compressed, and then sent to another server. The compression reduces the transmission time to the destination server, so the smaller we can get the file in a short period of time, the better. This is a highly time-sensitive process.
The data files contain many rows of tab-delimited text, and the order of the rows does not matter.
We noticed that when we sorted the file by the first field, the compressed file size was much smaller, presumably because duplicates of that column are next to each other. However, sorting a large file is slow, and there's no real reason that it needs to be in sorted other than that it happens to improves compression. There's also no relationship between what's in the first column and what's in subsequent columns. There could be some ordering of rows that compressed even smaller, or alternatively there could be an algorithm that could similarly improve compression performance but require less time to run.
What approach could I use to reorder rows to optimize the similarity between neighboring rows and improve compression performance?
Here are a few suggestions:
Split the file into smaller batches and sort those. Sorting multiple small sets of data is faster than sorting a single big chunk. You can also easily parallelize the work this way.
Experiment with different compression algorithms. Different algorithms have different throughput and ratio. You are interested in algorithms that are on the pareto frontier of those two dimensions.
Use bigger dictionary sizes. This allows the compressor to reference data that is further in the past.
Note, that sorting is important no matter what algorithm and dictionary size you chose because references to old data tend to use more bits. Also, sorting by a time dimension tends to group rows together that come from a similar data distribution. For example, Stack Overflow has more bot traffic at night than during the day. Probably, the UserAgent field value distribution in their HTTP logs greatly varies with the time of day.
If the columns contain different types of data, e.g.
Name, Favourite drink, Favourite language, Favourite algorithm
then you may find that transposing the data (e.g. changing rows into columns) will improve compression because for each new item the zip algorithm just needs to encode which item is favourite, rather than both which item and which category.
On the other hand, if a word is equally likely to appear in any column, then this approach is unlikely to be of any use.
Just in: Simply try using a different compression format. We found for our application (compressed SQLite db) that LZMA / 7z compresses about 4 times better compared to zip. Just saying, before you implement anything.

What are some information data structures?

Most data structures are designed to hold data.
Data is something that means something to a computer.
Information is something that means something to a human.
What data structures are designed more for information rather than data?
Examples might include things like xml, .jpg, and Gray codes which all have an information feel to me.
This looks like a too broad question. Information is stored as data in many different ways but ultimately the way you interpret it will give it some meaning. For instance a word document written in Chinese will be stored as data and interpreted by someone who knows how to read mandarin.
If you are talking about information retrieval using AI techniques, that's another story, also very broad. So be more specific to help yourself.
Finally, the way you store data some times is related to the way they are represented in real life. An image, a matrix, note a tree for example. Some more complex information, like a huge DNA sequence, are stored in a way that is more suitable for computers (to speed up pattern analysis for instance). So there is also a translation from information (suitable for humans) to data (suitable to computers) back and forth.
That's why there's job for us!
Books, newspapers, videos.
Media.
Information is data with a context. The context has to be a part of the data structure to be considered information.
XML is a good example. Pretty much any office document format, many of which have at least an XML representation. Charts and graphs. Plain text files.
I'm not sure I would include .jpg, since it's not really a human-readable format. You need a computer to display a .jpg for you or it's just data.
It's worth mentioning that just about any information is just the same... data arranged in a way that a person or machine can understand.

Resources