I'm looking for an answer to this question that comes from a class on data structures and algorithms. I learned about the merge sort but don't remember clusters and buffers. I'm not quite sure I understand the question. Can someone help explain or answer it?
A file of size 1 Million clusters is
to be sorted using 128 input buffers
of one cluster size. There is an
output buffer of one cluster size. How
many Disk I/O's will be needed if the
balanced k-way merge sort (a
multi-step merge) algorithm is used?
It is asking about the total number of disk operations, a cluster here can be any size.
You need to know how many Disk IOs are needed per iteration of a balanced k-way merge sort.
(hint: every merge pass requires reading and writing every value in the array from and to disk once)
Then you work out how many iterations must be performed to read your data.
The total number of Disk IOs can then be calculated.
Related
What happens when in offline sorting the dataset size exceeds RAM size?
I have to sort a large text file and want to know what happens if the size of my text file file exceeds the RAM size.
Offline sorting is precisely designed to work in the case where you can't fit your data set into RAM. The general idea is to split the data set apart into smaller pieces, each of which can fit into memory, and to sort each piece independently of the others. Afterwards, you can combine them all together.
The most common external sorting algorithm is an external mergesort. You begin by splitting your input apart into blocks of some fixed size - usually, as much as you can fit into RAM at one time - then sort those blocks independently and write the sorted versions back to disk. You then do a k-way merge operation to combine all of the sorted sequences back together; the specific algorithm used is usually a generalization of the normal 2-way merge algorithm combined with some buffering to minimize disk reads.
A less common approach is to use quicksort and a double-ended priority queue. You can read more about this here.
Can anyone explain me how jobs are mapped and reduces in hadoop and why are group by operations considered expensive?
I won't say expensive really. But I would use the word that it does affect the performance, as for order by or sort the processing needed to order the record is much more. The processing of data by comparator and partitioner would be huge when millions or billions of records are being sorted.
I hope I could answer your question.
Performance in Hadoop is affected by two main factors:
1- Processing: The execution time spent for processing the map and reduce tasks over the cluster node.
2- Communication: Shuffling data, Some operations needs to send data from one node to another one for processing (like global sorting).
Groupby needs complexity needs affects the two factors. In the shuffle half of the data size might be shuffled between the nodes.
B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?
Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!
I have a job that splits data into groups. and I need to keep only the large partitions (above a certain threshold).
Is there a method to this?
one solution is to iterate over all the items and store them in memory, and flush them only if they reach a certain size.
However, this solution may require a very large amount of memory.
I don't think there is a general straightforward solution (apart from storing until the size is reached). Maybe if you provide more details this will give us more inspiration?
I'm interested in performing knn search on large dataset.
There are some libs: ANN and FLANN, but I'm interested in the question: how to organize the search if you have a database that does not fit entirely into memory(RAM)?
I suppose it depends on how much bigger your index is in comparison to the memory. Here are my first spontaneous ideas:
Supposing it was tens of times the size of the RAM, I would try to cluster my data using, for instance, hierarchical clustering trees (implemented in FLANN). I would modify the implementation of the trees so that they keep the branches in memory and save the leaves (the clusters) on the disk. Therefore, the appropriate cluster would have to be loaded each time. You could then try to optimize this in different ways.
If it was not that bigger (let's say twice the size of the RAM), I would separate the dataset in two parts and create one index for each. I would therefore need to find the nearest neighbor in each dataset and then choose between them.
It depends if your data is very high-dimensional or not. If it is relatively low-dimensional, you can use an existing on-disk R-Tree implementation, such as Spatialite.
If it is a higher dimensional data, you can use X-Trees, but I don't know of any on-disk implementations off the top of my head.
Alternatively, you can implement locality sensitive hashing with on disk persistence, for example using mmap.