Data structures : Lists and Arrays - data-structures

If in Lists, data is stored in random memory locations with a pointer to the next location, how can we read the lists easily with an index?
In Array, it is a continuous chunk of memory so indexing is possible.
I read that reading from an array is efficient than reading from a list.
I'm confused.
I'm particularly concerned with Python.

Related

is not the benefit of B-Tree lost when it is saved in File?

I was reading about B-Tree and it was interesting to know that it is specifically built for storing in secondary memory. But i am little puzzled with few points:
If we save the B-Tree in secondary memory (via serialization in Java) is not the advantage of B-Tree lost ? because once the node is serialized we will not have access to reference to child nodes (as we get in primary memory). So what that means is, we will have to read all the nodes one by one (as no reference is available for child). And if we have to read all the nodes then whats the advantage of the tree ? i mean, in that case we are not using the binary search on the tree. Any thoughts ?
When a B-Tree is used on disk, it is not read from file, deserialized, modified, and serialized, and written back.
A B-Tree on disk is a disk-based data structure consisting of blocks of data, and those blocks are read and written one block at a time. Typically:
Each node in the B-Tree is a block of data (bytes). Blocks have fixed sizes.
Blocks are addressed by their position in the file, if a file is used, or by their sector address if B-Tree blocks are mapped directly to disk sectors.
A "pointer to a child node" is just a number that is the node's block address.
Blocks are large. Typically large enough to have 1000 children or more. That's because reading a block is expensive, but the cost doesn't depend much on the block size. By keeping blocks big enough so that there are only 3 or 4 levels in the whole tree, we minimize the number of reads or writes required to access any specific item.
Caching is usually used so that most accesses only need to touch the lowest level of the tree on disk.
So to find an item in a B-Tree, you would read the root block (it will probably come out of cache), look through it to find the appropriate child block and read that (again probably out of cache), maybe do that again, finally read the appropriate leaf block and extract the data.

how to create a de-duplication engine that acts as a file storage, retrieval, and handling system?

how to create a de-duplication engine that acts as a file storage, retrieval, and handling system. It must take some files as inputs, take data from it in chunks of 8 byte size and store it in some efficient data structure of our choice. The data structure should be robust and must not store duplicate chunks. Instead, it has to make a reference to the original chunk that is repeated.
It depends on the details.
How big are your chunks? Are the chunks 8 byte in size, or why is the 8byte size important?
How many different chunks do you expect, would a copy of all possible chunk variations fit into memory?
If all possible would not fit into memory, you could simply create hashcodes for each chunk and store them in a hashmap or tree, with a reference to a location to disk where the full chunk is found. Then if you look at a new chunk, you calculate the hascode. If the hashcode is not in the map, you got a new chunk which has not been seen before. If the hashcode matches an entry in the hashmap, you have to lookup the referenced chunk and compare the two chunk to see whether they are identical or not.
If the chunk do fit into memory you can still use the hashmap approach, simply with buffering the unique chunks in memory instead of having to load them from disk. You could also you a critbit index. With a critbit index you could use the complete chunk as key, thus creating an ordered collection of chunks. The benefit would be that the critbit tree can use prefix-sharing, so it can store more chunks than would usually fit into memory; if to chink differ only in the last few bytes, it would store leading bytes only once for both chunks.

joins vs distributed cache in hadoop

what is the difference between joins and distributed cache in hadoop. I am really confusing with map-side join and reduce-side join an dhow it works. how distributed cache is different while processing the data in mapreduce job. Please share with example.
Regards,
Ravi
Let's say you have 2 files of data with the following records:
word -> frequency
Same words can be present in both files.
Your task is to merge these files, compute total frequency for each term, and produce the aggregated file.
Map side joins.
Useful when your data on both sides of the join already presorted by keys. In this case, it is a simple merge of two streams with linear complexity. In our example, our word-frequency data have to be pre-sorted alphabetically by words in both files.
Pros: works with virtually unlimited input data (does not have to fit in memory).
Does not require a reducer, thus it is very efficient.
Cons: requires your input data to be pre-sorted (for example, as a result of a previous map/reduce job)
Reduce joins.
Useful when our files are not sorted yet, and they are too large to fit in memory. So you have to merge them using distributed sort with reducer(s).
Pros: works with virtually unlimited input data (does not have to fit in memory).
Cons: requires reduce phase
Distributed cache.
Useful when our input word-frequency files are NOT sorted, and one of two files is small enough to fit in memory. In this case you can use it as a distributed cache, and load it in memory as a hash table Map<String, Integer>. Each mapper than will stream the largest input file as key value pairs and look up the values of the smaller file from the hash map.
Pros: Efficient, linear complexity based on largest input set size. Does not require reducer.
Cons: Requires one of the inputs to fit in memory.

Can you know how many input values has a reducer in Hadoop without iterating on them?

I am writing a Reducer in Hadoop and I am using its input values to build a byte array which encodes a list of elements. The size of the buffer in which I write my data depends on the number of values the reducer receives. It would be efficient to allocate its size in memory in advance, but I don't know how many values are without iterating on them with a "foreach" statement.
Hadoop output is an HBase table.
UPDATE:
After processing my data with the mapper the reducer keys have a power law distribution. This means that only a few keys have a lot of value (at most 9000), but most of them have just a few values. I noticed that by allocating a buffer of 4096 bytes, 97.73% of the values fit in it. For the rest of them I can try to reallocate a buffer with double capacity, until all values fit in it. For my test case this can be accomplished by reallocating memory 6 times for the worst case, when there are 9000 values for a key.
I assume you're going to go through them with for-each anyway, after you've allocated your byte array, but you don't want to have to buffer all the records in memory (as you can only loop through the iterator you get back from your value collection once). Therefore, you could
Run a counting reducer that outputs every input record and also outputs the count to a record that is of the same value class as the map output, and then run a "reduce-only" job on that result using a custom sort that puts the count first (recommended)
Override the built-in sorting you get with Hadoop to count while sorting and inject that count record as the first record of its output (it's not totally clear to me how you would accomplish the override, but anything's possible)
If the values are unique, you might be able to have a stateful sort comparator that retains a hash of the values with which it gets called (this seems awfully hacky and error prone, but I bet you could get it to work if the mechanics of secondary sort are confined to one class loader in one JVM)
Design your reducer to use a more flexible data structure than a byte array, and convert the result to a byte array before outputting if necessary (highly recommended)
You can use the following paradigm:
Map: Each mapper keeps a map from keys to integers, where M[k] is number of values sent out with a certain key k. At the end of its input, the map will also send out the key-value pairs (k, M[k]).
Sort: Use secondary sort so that the pairs (k, M[k]) come before the pairs (k, your values).
Reduce: Say we're looking at key k. Then the reducer first aggregates the counts M[k] coming from the different mappers to obtain a number n. This is the number you're looking for. Now you can create your data structure and do your computation.

Data structures and Algorithm analysis question

I'm looking for an answer to this question that comes from a class on data structures and algorithms. I learned about the merge sort but don't remember clusters and buffers. I'm not quite sure I understand the question. Can someone help explain or answer it?
A file of size 1 Million clusters is
to be sorted using 128 input buffers
of one cluster size. There is an
output buffer of one cluster size. How
many Disk I/O's will be needed if the
balanced k-way merge sort (a
multi-step merge) algorithm is used?
It is asking about the total number of disk operations, a cluster here can be any size.
You need to know how many Disk IOs are needed per iteration of a balanced k-way merge sort.
(hint: every merge pass requires reading and writing every value in the array from and to disk once)
Then you work out how many iterations must be performed to read your data.
The total number of Disk IOs can then be calculated.

Resources