MPI in parallel programming - parallel-processing

I have a project but I didn't solve it.
Computing the sum of the elements of an array
The objective of the problem is to write a serial, then a parallel program that takes as input an array of integers, stored in an ASCII file with one integer per line, and prints out the sum of the elements in the file. For instance:
% cat input
1
24
9
% my_program input
34
Write a serial program to solve this problem. Name the source code sum-serial.c then write an MPI implementation of the above program in which a master process reads in the entire input file and then dispatches pieces of it to workers, which these pieces being of as equal size as possible. The master must also perform computation. Each processor computes a local sum and results are then collected and aggregated by the master. This implementation should not use any collective communications, but only point-to-point.

Related

Parallel Programming Vector Addition

Is vector addition being processed sequentially faster than being processed in parallel due to overhead in mpi? I have used mpi by scattering two arrays then processing a certain number of vectors pairs locally for each slave and then perform a gather to send all values back to the master.
Yes, this is totally expected. Vector addition is dominated by the cost to read and write the values from memory. An addition is orders of magnitude faster than reading/writing one element from memory. Attempting to scatter/add/gather is futile to improve performance. To gain performance from scatter/gather you must either perform a very expensive operation on each data element or use each data element multiple times.
In an idiomatic MPI program, the vectors should exist distributed in the first place.
Edit: The same holds true for Vector / Matrix multiplication given that each element of the matrix is accessed only once.

Achieving interactive large-dataset map-reduce on AWS/GCE in the least lines of code / script

I have 1 billion rows of data (about 400GB uncompressed; about 40GB compressed) that I would like to process in map-reduce style, and I have two executables (binaries, not scripts) that can handle the "map" and "reduce" steps. The "map" step can process about 10,000 rows per second, per core, and its output is approximately 1MB in size, regardless of the size of its input. The "reduce" step can process about 50MB / second (excluding IO latency).
Assume that I can pre-process the data once, to do whatever I'd like such as compress it, break it into pieces, etc. For simplicity, assume input is plain text and each row terminates with a newline and each newline is a row terminator.
Once that one-time pre-processing is complete, the goal is to be able to execute a request within 30 seconds. So, if my only bottleneck is the map job (which I don't know will really be true-- it could very well be the IO), and assuming I can do all the reduce jobs in under 5 seconds, then I would need about 425 8-core computers, all processing different parts of the input data, to complete the run in time.
Assuming you have the data, and the two map/reduce executables, and you have unlimited access to AWS or GCE, what is a solution to this problem that I can implement with the fewest lines of code and/or script (and not ignoring potential IO or other non-CPU bottlenecks)?
(As an aside, it would be interesting to also knowing what would execute with the fewest nodes, if different from the solution with fewest SLOC)

OpenMP with MPI- accessing array values which is available only to Master process

Say I have an array which is initialized in the Master process (rank=0) and contains random integers.
I want to sum all its (the array) elements by a Slave process (rank=1) when the full array is only available to the Master process (meaning I can't just MPI_SEND the full array to the slave).
I know I can use schedule in order to divide the work between multiple threads, but I'm not sure how to do it without sending the whole array to the Slave process.
Also, I've been checking different clauses while trying to solve the problem and came across REDUCTION, I'm not sure exactly how it works.
Thanks!
What you want to do is indeed a reduction with sum as the operation. Here is how a reduction works: You have a collection of items and an operation you wish to perform that reduces them to a single item. For example, you want to sum every element in an array and end with a single number that is their sum.
To do this efficiently you divide your collection into equal sized chunks and distribute them to each participating process. Each process applies the operation to the elements in the collection until the process has a single value. In our running example, each process adds together its chunk of the array. Then half the processes send their results to another node which then applies the operation to the value it computed and the value it received. At this point only half the original processes are participating. We repeat this until one process has the final result.
Here is a link to a graphic that should make this a lot easier to understand: http://3.bp.blogspot.com/-ybPe3bJrpgc/UzCoG9BUFuI/AAAAAAAAB2U/Jz6UcwV_Urk/s1600/TreeStructure.JPG
Here is some MPI code for a reduction: https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_array.c

Merge sorted files effectively?

I have n files, 50 <= n <= 100 that contain sorted integers, all of them the same size, 250MB or 500MB.
e.g
1st file: 3, 67, 123, 134, 200, ...
2nd file: 1, 12, 33, 37, 94, ...
3rd file: 11, 18, 21, 22, 1000, ...
I am running this on a 4-core machine and the goal is to merge the files as soon as possible.
Since the total size can reach 50GB I can't read them into RAM.
So far I tried to do the following:
1) Read a number from every file, and store them in an array.
2) Find the lowest number.
3) Write that number to the output.
4) Read one number from the file you found the lowest before (if file not empty)
Repeat steps 2-4 till we have no numbers left.
Reading and writing is done using buffers of 4MB.
My algorithm above works correctly but it's not perfomning as fast as I want it. The biggest issue is that it perfoms much worst if I have 100 files x 250MB compared to having 50 files x 500MB.
What is the most efficient merge algorithm in my case?
Well, you can first significantly improve efficiency by improving step (2) in your algorithm to be done smartly. Instead to do a linear search on all the numbers, use a min-heap, any insertion and deletion of the minimal value from the heap is done in logarithmic time, so it will improve the speed for large number of files. This changes time complexity to O(nlogk), over the naive O(n*k) (where n is total number of elements and k is number of files)
In addition, you need to minimize number of "random" reads from files, because few sequential big reads are much faster than many small random reads. You can do that by increasing the buffer size, for example (same goes for writing)
(java) Use GZipInputStream and GZipOutputStream for the .gz compression. Maybe that will allow memory usage to some extent. Using fast instead of high compression.
Then movement on disk for several files should be reduced, say more merging files by 2 files, both larger sequences.
For repetitions maybe use "run-length-encoding" - instead of repeating, add a repetition count: 11 12 13#7 15
An effective way to utilize the multiple cores might be to perform input and output in distinct threads from the main comparison thread, in such a way that all the cores are kept busy and the main thread never unnecessarily blocks on input or output. One thread performing the core comparison, one writing the output, and NumCores-2 processing input (each from a subset of the input files) to keep the main thread fed.
The input and output threads could also perform stream-specific pre- and post-processing - for example, depending on the distribution of the input data a run length encoding scheme of the type alluded to by #Joop might provide significant speedup of the main thread by allowing it to efficiently order entire ranges of input.
Naturally all of this increases complexity and the possibility of error.

Controlling number of lines to be written to the output file

I am new to Hadoop programming.
I have a situation in which I want to stop writing <k3,v3> to my output file after n-lines.
In my program, I am sure that the output file will be sorted according to k3, but I don't want the entire list. I only want the first n.
Is there a mechanism in Hadoop to do this?
I couldn't find an Class/API for the same.
But, you could increment a Counter when the OutputCollector.collect() is called in the Reduce function. When the counter reaches the a certain value, stop calling the OutputCollector.collect().
It's a waste of CPU cycles because the reduce tasks keeps on running even after n lines are written to the o/p. There might be a better approach for the problem.

Resources