Are files data structures? - data-structures

I'm new to programming and I'd like to know if files such as BMPs, MP3s, EXEs are considered to be data structures as well.
I'm new to programming and I'd like to know if files such as BMPs, MP3s, EXEs are considered to be data structures as well.

No, they are some form of compressed(or not) data that should be read on any kind of program that can read them.
But they are structured data. That means you have some specific way your program should read them. For example, in bmp you should know how to read its width and height of the image, then start reading its pixels. Then you continue looping it until its over.
There is more complexes structured datas, as exe's, which your operating system reads, or mp3 which you have to execute some algorithms to make the data understandable
Data structures are actually some default way to think about how to store and read your data and use them for specific situations, such as a command history.
The well known command CTRL+Z, CTRL+SHIFT+Z, (undo & redo), they are implemented as stacks, which means each command is piled up one above the other, when undoing you will need to take the command that is topmost in the stack, pop it and execute its undo function.

Not really. Frow Wikipedia, "In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data."
You normally read or write such files as a whole and do not perform local modifications. Anyway, for some formats (such as TIFF images for instance), the individual data fields can be accessed directly rather than sequentially.

Related

Working with a Set that does not fit in memory

Let's say I have a huge list of fixed-length strings, and I want to be able to quickly determine if a new given string is part of this huge list.
If the list remains small enough to fit in memory, I would typically use a set: I would feed it first with the list of strings, and by design, the data structure would allow me to quickly check whether or not a given string is part of the set.
But as far as I can see, the various standard implementation of this data structure store data in memory, and I already know that the huge list of strings won't fit in memory, and that I'll somehow need to store this list on disk.
I could rely on something like SQLite to store the strings in a indexed table, then query the table to know whether a string is part of the initial set or not. However, using SQLite for this seems unnecessarily heavy to me, as I definitely don't need all the querying features it supports.
Have you guys faced this kind of problems before? Do you know any library that might be helpful? (I'm quite language-agnostic, feel free to throw whatever you have)
There are multiple solutions to efficiently find if a string is a part of a huge set of strings.
A first solution is to use a trie to make the set much more compact. Indeed, many strings will likely start by the same header and re-writing it over and over in memory is not space efficient. It may be enough to keep the full set in memory or not. If not, the root part of the trie can be stored in memory referencing leaf-like nodes stored on the disk. This enable the application to quickly find with part of the leaf-like nodes need to be loaded with a relatively small cost. If the number of string is not so huge, most leaf parts of the trie related to a given leaf of the root part can be loaded in one big sequential chunk from the storage device.
Another solution is to use a hash table to quickly find if a given string exist in the set with a low latency (eg. with only 2 fetches). The idea is just to hash a searched string and perform a lookup at a specific item of a big array stored on the storage device. Open-adressing can be used to make the structure more compact at the expense of a possibly higher latency while only 2 fetches are needed with closed-adressing (the first get the location of the item list associated to the given hash and the second get all the actual items).
One simple way to easily implement such data structures so they can work on a storage devices is to make use of mapped memory. Mapped memory enable you to access data on a storage device transparently as if it was in memory (whatever the language used). However, the cost to access data is the one of the storage device and not the one of the memory. Thus, the data structure implementation should be adapted to the use of mapped memory for better performance.
Finally, you can cache data so that some fetches can be much faster. One way to do that is to use Bloom filters. A Bloom filter is a very compact probabilistic hash-based data structure. It can be used to cache data in memory without actually storing any string item. False positive matches are possible, but false negatives are not. Thus, they are good to discard searched strings that are often not in the set without the need to do any (slow) fetch on the storage device. A big Bloom filter can provide a very good accuracy. This data structure need to be mixed with the above ones if deterministic results are required. LRU/LFU caches might also help regarding the distribution of the searched items.

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

Reasons to use SequenceFile instead of text file in Hadoop

What are the reasons to use SequenceFile instead of a text file?
I'm guessing that they are good since input/output comes to serialization, instead of parsing an object, if that object needs to be used multiple times.
Also, I read that it performs compression of the file, so it takes less space and that it is good to aggregate many small files into one large one.
Are this arguments valid and what else?
Binary data (as in SequenceFiles) is usually more compact than text data (TextFiles) even without explicit compression. So less data needs to be read from/written to the hard disks. The space savings depend on the data that is written.
Reading binary data is more CPU efficient than String parsing.
However,
SequenceFiles cannot be read humans and
are bound to a specific object type / class, whereas text data can be interpreted in different ways as needed.

Data structures for audio editor

I have been writing an audio editor for the last couple of months, and have been recently thinking about how to implement fast and efficient editing (cut, copy, paste, trim, mute, etc.). There doesn't really seem to be very much information available on this topic, however... I know that Audacity, for example, uses a block file strategy, in which the sample data (and summaries of that data, used for efficient waveform drawing) is stored on disk in fixed-sized chunks. What other strategies might be possible, however? There is quite a lot of info on data-structures for text editing - many text (and hex) editors appear to use the piece-chain method, nicely described here - but could that, or something similar, work for an audio editor?
Many thanks in advance for any thoughts, suggestions, etc.
Chris
the classical problem for editors handling relative large files is how to cope with deletion and insertion. Text editors obviously face this, as typically the user enters characters one at a time. Audio editors don't typically do "sample by sample" inserts, i.e. the user doesn't interactively enter one sample per time, but you have some cut-and-paste operations. I would start with a representation where an audio file is represented by chunks of data which are stored in a (binary) search tree. Insert works by splitting the chunk you are inserting into two chunks, adding the inserted chunk as a third one, and updating the tree. To make this efficient and responsive to the user, you should then have a background process that defragments the representation on disk (or in memory) and then makes an atomic update to the tree holding the chunks. This should make inserts and deletes as fast as possible. Many other audio operations (effects, normalize, mix) operate in-place and do not require changes to the data structure, but doing e.g. normalize on the whole audio sample is a good opportunity to defragment it at the same time. If the audio samples are large, you can keep the chunks as it is standard on hard disk also. I don't believe the chunks need to be fixed size; they can be variable size, preferably 1024 x (power of two) bytes to make file operations efficient, but a fixed-size strategy can be easier to implement.

What are some information data structures?

Most data structures are designed to hold data.
Data is something that means something to a computer.
Information is something that means something to a human.
What data structures are designed more for information rather than data?
Examples might include things like xml, .jpg, and Gray codes which all have an information feel to me.
This looks like a too broad question. Information is stored as data in many different ways but ultimately the way you interpret it will give it some meaning. For instance a word document written in Chinese will be stored as data and interpreted by someone who knows how to read mandarin.
If you are talking about information retrieval using AI techniques, that's another story, also very broad. So be more specific to help yourself.
Finally, the way you store data some times is related to the way they are represented in real life. An image, a matrix, note a tree for example. Some more complex information, like a huge DNA sequence, are stored in a way that is more suitable for computers (to speed up pattern analysis for instance). So there is also a translation from information (suitable for humans) to data (suitable to computers) back and forth.
That's why there's job for us!
Books, newspapers, videos.
Media.
Information is data with a context. The context has to be a part of the data structure to be considered information.
XML is a good example. Pretty much any office document format, many of which have at least an XML representation. Charts and graphs. Plain text files.
I'm not sure I would include .jpg, since it's not really a human-readable format. You need a computer to display a .jpg for you or it's just data.
It's worth mentioning that just about any information is just the same... data arranged in a way that a person or machine can understand.

Resources