Parallel Programming Vector Addition - parallel-processing

Is vector addition being processed sequentially faster than being processed in parallel due to overhead in mpi? I have used mpi by scattering two arrays then processing a certain number of vectors pairs locally for each slave and then perform a gather to send all values back to the master.

Yes, this is totally expected. Vector addition is dominated by the cost to read and write the values from memory. An addition is orders of magnitude faster than reading/writing one element from memory. Attempting to scatter/add/gather is futile to improve performance. To gain performance from scatter/gather you must either perform a very expensive operation on each data element or use each data element multiple times.
In an idiomatic MPI program, the vectors should exist distributed in the first place.
Edit: The same holds true for Vector / Matrix multiplication given that each element of the matrix is accessed only once.

Related

How to determine the optimal capacity for Quadtree subdivision?

I've created a flocking simulation using Boid's algorithm and have integrated a quadtree for optimization. Boids are inserted into the quadtree if the quadtree has not yet met its boid capacity. If the quadtree has met its capacity, it will subdivide into smaller quadtrees and the remaining boids will try to insert again on that one, recursively.
The performance seems to get better if I increase the capacity from its default 4 to one that is capable of holding more boids like 20, and I was just wondering if there is any sort of rule or methodology that goes into picking the optimal capacity formulaically.
You can view the site live here or the source code here if relevant.
I'd assume it very much depends on your implementation, hardware, and the data characteristics.
Implementation:
An extreme case would be using GPU processing to compare entries. If you support that, having very large nodes, potentially just a single node containing all entries, may be faster than any other solution.
Hardware:
Cache size and Bus speed will play a big role, also depending on how much memory every node and every entry consumes. Accessing a sub-node that is not cached is obviously expensive, so you may want to increase the size of nodes in order to reduce sub-node traversal.
-> Coming back to implementation, storing the whole quadtree on a contiguous segment of memory can be very beneficial.
Data characteristics:
Clustered data: Having strongly clustered data can have adverse effect on performance because it may cause the tree to become very deep. In this case, increasing node size may help.
Large amounts of data will mean that you may get over a threshold very everything fits into a cache. In this case, making nodes larger will save memory because you will have fewer nodes and everything may fit into the cache again.
In my experience I found that 10-50 entries per node gives the best performance across different datasets.
If you update your tree a lot, you may want to define a threshold to avoid 'flickering' and frequent merging/splitting of nodes. I.e. split nodes with >25 entries but merge them only when getting less than 15 entries.
If you are interested in a quadtree-like structure that avoids degenerated 'deep' quadtrees, have a look at my PH-Tree. It is structured like a quadtree but operates on bit-level, so maximum depth is strictly limited to 64 or 32, depending on how many bits your data has. In practice the depth will rarely exceed 10 levels or so, even for very dense data. Note: A plain PH-Tree is a key-value 'map' in the sense that every coordinate (=key) can only have one entry (=value). That means you need to store lists or sets of entries in case you expect more than one entry for any given coordinate.

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.
Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...
Any idea?
Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).
If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations.
To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).
If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).
Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).
Here is an example using hash-sets in Python:
seen = set() # List of duplicated values seen so far
for value in inputStream: # Iterate over the stream value
if value not in seen: # O(1) lookup
log.write(value) # Value not duplicated here
seen.add(value) # O(1) appending

Balanced trees and space and time trade-offs

I was trying to solve problem 3-1 for large input sizes given in the following link http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/assignments/MIT6_006F11_ps3_sol.pdf. The solution uses an AVL tree for range queries and that got me thinking.
I was wondering about scalability issues when the input size increases from a million to a billion and beyond. For instance consider a stream of integers (size: 4 bytes) and input of size 1 billion, the space required to store the integers in memory would be ~3GB!! The problem gets worse when you consider other data types such as floats and strings with the input size the order of magnitude under consideration.
Thus, I reached the conclusion that I would require the assistance of secondary storage to store all those numbers and pointers to child nodes of the AVL tree. I was considering storing the left and right child nodes as separate files but then I realized that that would be too many files and opening and closing the files would require expensive system calls and time consuming disk access and thus at this point I realized that AVL trees would not work.
I next thought about B-Trees and the advantage they provide as each node can have 'n' children, thereby reducing the number of files on disk and at the same time packing in more keys at every level. I am considering creating separate files for the nodes and inserting the keys in the files as and when they are generated.
1) I wanted to ask if my approach and thought-process is correct and
2) Whether I am using the right data structure and if B-Trees are the right data structure what should the order be to make the application efficient? What flavour of B Trees would yield maximum efficiency. Sorry for the long post! Thanks in advance for your replies!
Yes, you're reasoning is correct, although there are probably smarter schemes than to store one node per file. In fact, a B(+)-Tree often outperforms a binary search tree in practice (especially for very large collections) for numerous reasons and that's why just about every major database system uses it as its main index structure. Some reasons why binary search trees don't perform too well are:
Relatively large tree height (1 billion elements ~ height of 30 (if perfectly balanced)).
Every comparison is completely unpredictable (50/50 choice), so the hardware can't pre-fetch memory and fill the cpu pipeline with instructions.
After the upper few levels, you jump far away and to unpredictable locations in memory, each possibly requiring accessing the hard drive.
A B(+)-Tree with a high order will always be relatively shallow (height of 3-5) which reduces number of disk accesses. For range queries, you can read consecutively from memory while in binary trees you jump around a lot. Searching in a node may take a bit longer, but practically speaking you are limited by memory accesses not CPU time anyway.
So, the question remains what order to use? Usually, the node size is chosen to be equal to the page size (4-64KB) as optimizing for disk accesses is paramount. The page size is the minimal consecutive chunk of memory your computer may load from disk to main memory. Depending on the size of your key, this will result in a different number of elements per node.
For some help for the implementation, just look at how B+-Trees are implemented in database systems.

OpenMP with MPI- accessing array values which is available only to Master process

Say I have an array which is initialized in the Master process (rank=0) and contains random integers.
I want to sum all its (the array) elements by a Slave process (rank=1) when the full array is only available to the Master process (meaning I can't just MPI_SEND the full array to the slave).
I know I can use schedule in order to divide the work between multiple threads, but I'm not sure how to do it without sending the whole array to the Slave process.
Also, I've been checking different clauses while trying to solve the problem and came across REDUCTION, I'm not sure exactly how it works.
Thanks!
What you want to do is indeed a reduction with sum as the operation. Here is how a reduction works: You have a collection of items and an operation you wish to perform that reduces them to a single item. For example, you want to sum every element in an array and end with a single number that is their sum.
To do this efficiently you divide your collection into equal sized chunks and distribute them to each participating process. Each process applies the operation to the elements in the collection until the process has a single value. In our running example, each process adds together its chunk of the array. Then half the processes send their results to another node which then applies the operation to the value it computed and the value it received. At this point only half the original processes are participating. We repeat this until one process has the final result.
Here is a link to a graphic that should make this a lot easier to understand: http://3.bp.blogspot.com/-ybPe3bJrpgc/UzCoG9BUFuI/AAAAAAAAB2U/Jz6UcwV_Urk/s1600/TreeStructure.JPG
Here is some MPI code for a reduction: https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_array.c

External Sorting with a heap?

I have a file with a large amount of data, and I want to sort it holding only a fraction of the data in memory at any given time.
I've noticed that merge sort is popular for external sorting, but I'm wondering if it can be done with a heap (min or max). Basically my goal is to get the top (using arbitrary numbers) 10 items in a 100 item list while never holding more than 10 items in memory.
I mostly understand heaps, and understand that heapifying the data would put it in the appropriate order, from which I could just take the last fraction of it as my solution, but I can't figure out how to do with without an I/O for every freakin' item.
Ideas?
Thanks! :D
Using a heapsort requires lots of seek operations in the file for creating the heap initially and also when removing the top element. For that reason, it's not a good idea.
However, you can use a variation of mergesort where every heap element is a sorted list. The size of the lists is determined by how much you want to keep in memory. You create these lists from the input file using by loading chunks of data, sorting them and then writing them to a temporary file. Then, you treat every file as one list, read the first element and create a heap from it. When removing the top element, you remove it from the list and restore the heap conditions if necessary.
There is one aspect though that makes these facts about sorting irrelevant: You say you want to determine the top 10 elements. For that, you could indeed use an in-memory heap. Just take an element from the file, push it onto the heap and if the size of the heap exceeds 10, remove the lowest element. To make it more efficient, only push it onto the heap if the size is below 10 or it is above the lowest element, which you then replace and re-heapify. Keeping the top ten in a heap allows you to only scan through the file once, everything else will be done in-memory. Using a binary tree instead of a heap would also work and probably be similarly fast, for a small number like 10, you could even use an array and bubblesort the elements in place.
Note: I'm assuming that 10 and 100 were just examples. If your numbers are really that low, any discussion about efficiency is probably moot, unless you're doing this operation several times per second.
Yes, you can use a heap to find the top-k items in a large file, holding only the heap + an I/O buffer in memory.
The following will obtain the min-k items by making use of a max-heap of length k. You could read the file sequentially, doing an I/O for every item, but it will generally be much faster to load the data in blocks into an auxillary buffer of length b. The method runs in O(n*log(k)) operations using O(k + b) space.
while (file not empty)
read block from file
for (i = all items in block)
if (heap.count() < k)
heap.push(item[i])
else
if (item[i] < heap.root())
heap.pop_root()
heap.push(item[i])
endif
endfor
endwhile
Heaps require lots of nonsequential access. Mergesort is great for external sorting because it does a whole lot of sequential access.
Sequential access is a hell of a lot faster on the kinds of disks that spin because the head doesn't need to move. Sequential access will probably also be a hell of a lot faster on solid-state disks than heapsort's access because they do accesses in blocks that are probably considerably larger than a single thing in your file.
By using Merge sort and passing the two values by reference you only have to hold the two comparison values in a buffer, and move throughout the array until it is sorted in place.

Resources