How to parallelize column loop in a 2d array in openmp - openmp

I am trying to parallelize column loop in the following code in openmp. But each thread prints the array individually.I just want to split column loop equally among the threads.
for(i=1;i<=5;i++)
{
#pragma omp parallel for
for(j=1;j<=5;j++)
{
printf("arr[i][j]=%d",tid);
}
}
My expected output is
arr[1][1]=thread 0
arr[1][2]=thread 1
arr[1][3]=thread 2
arr[1][4]=thread 3
arr[1][5]=thread 0
arr[2][1]=thread 0
......

Related

Ignoring non integers in an unknown length of data input

i am new to c language and seeking help in understanding my mistake.
I want to write a program that counts the number of 2 digit numbers in a row of integers and chars, for example " 21c sdhhj 32 fhddhf234 45" here are 3 two digit numbers. I set terminations to my loop (failed scanf %d or EOF) and still get an infinite loop. I understand thet failed scanf of integers should return 0 or -1 at EOF so why i get infinite loop? Thank you in advance! :)
void read(int blue[],int red[],int couple[])
{
int vote=0,rcount=0,bcount=0;
int ok=-2;
while (ok!=EOF)
{
ok=scanf("%d",&vote);
if (ok==0)
continue;
if (vote<TOTAL&&vote>0)
{
rcount=vote%10;
bcount=vote/10;
if (rcount==bcount)
continue;
couple[vote]++;
red[rcount]++;
blue[bcount]++;
}
ok=0;
}
i want to scan and store them as long as they are smaller then TOTAL (99) until the input is over.

Cyclomatic Complexity number - do I have to count every statement separately as node?

I came across different ways of calculating CCN (according to formula CCN = E-N+2P)
One way was to count all the lines in the code separately and the other way is to count a few lines of code as one step; lets have the following example:
1 public class SumAndAverage {
2
3 public static void main (String[] args) {
4 int sum = 0;
5 double average = 0.0;
6 String message = "";
7
8 int num = Integer.parseInt(args[0]);
9
10 if ((num < 1) || (num > 100)) {
11 message = "Invalid number entered.";
12 } else {
13 for (int i = 1; i <= num; i++) {
14 sum += i;
15 }
16 average = (double) sum / num;
17 message = "The sum is " + sum + " and the average is " + average;
18 }
19 System.out.println(message);
20 }
21}
Counting every statement we'd get 12 - 11 + 2x 1 = 3
I was wondering if I "join" lines 4,5,6,8 and count them as one step and do the same with line 16 and 17, would that be correct too? The result would be the same as no of edges would also decrease: 8 - 7 + 2*1 = 3
The right way to calculate complexity is by considering blocks of code. A block of code is where there is no chance of dissecting the execution path.
McCabe's paper mentions the below:
The tool, FLOW, was written in APL to input the source code from Fortran files on disk. FLOW would then break a Fortran job into distinct subroutines and analyze the control structure of each subroutine. It does this by breaking the Fortran subroutines into blocks that are delimited by statements that affect control flow: IF, GOTO ,referenced LABELS, DO, etc.
For other information on complexity, also read through Cyclomatic complexity as a Quality measure

OpenMP program freezing before starting loop?

I have a program I am trying to parallelize using OpenMP - it makes a very large loop over some data. Since incrementing a shared variable (so I can report progress as it goes) is somewhat of an issue, I thought I'd break the loop up into smaller chunks, loop over those multiple times, and just report the status at the end of/outside the openmp loop.
Problem is, before the OpenMP for loop starts for the 3rd time, the program locks up. Just sits there, does nothing. I've stripped out all but the simplest code. Here it is:
some other variable declarations for removed code above here
int dbl = 0;
int lasttime = 0;
int seedbase = 0;
const char *pl = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
const double mm = 62.0 / 2147483647.0;
for(dbl = 0; dbl < 2048 && !abort; dbl++) {
seedbase = dbl; //(dbl * 2097152) - 2147483648;
printf("Loop %d %d\n", dbl, abort);
#pragma omp parallel for private(seed) shared(dbl)
for(seed = 0; seed < 20971; seed++) { //52
if(dbl == 2)
printf("oo\n");
}
if(abort)
break;
lasttime = time();
hps = (double)((dbl*2097152) * clk_tck) / (double)((times(&tms) - start_time));
printf("So far: %0.2fsec (%0.2fhps) %0.2f sec left\n", (double)(times(&tms) - start_time) / (double)clk_tck, hps, (((long)1 << 32) - (dbl * 2097152)) / hps);
}
}
When compiled and run, I get:
Loop 0 0
So far: 0.02sec (0.00hps) inf sec left
Loop 1 0
So far: 0.02sec (104857600.00hps) 40.94 sec left
Loop 2 0
^C
Loop 0 starts, and the openmp runs (and does nothing) then exits, and the "So far:" is printed.
Loop 1 starts, same thing.
Loop 2 starts, and everything hangs. The printf("oo"); never happens. If I change the line to be if(dbl <= 2) my screen fills with looped "oo"'s as the loop runs.
But before the seed loop ever happens the third time - it's dead. Just sits there chewing up CPU time doing nothing.
Can you not quickly loop over a openmp loop? Is that the issue? I find it odd it's ALWAYS stopping before the 3rd run, regardless of how complex the code inside the seed loop is (I removed 200 lines of code - it had no effect)

Using Grand Central Dispatch in Swift to parallelize and speed up “for" loops?

I am trying to wrap my head around how to use GCD to parallelize and speed up Monte Carlo simulations. Most/all simple examples are presented for Objective C and I really need a simple example for Swift since Swift is my first “real” programming language.
The minimal working version of a monte carlo simulation in Swift would be something like this:
import Foundation
import Cocoa
var winner = 0
var j = 0
var i = 0
var chance = 0
var points = 0
for j=1;j<1000001;++j{
var ability = 500
var player1points = 0
for i=1;i<1000;++i{
chance = Int(arc4random_uniform(1001))
if chance<(ability-points) {++points}
else{points = points - 1}
}
if points > 0{++winner}
}
println(winner)
The code works directly pasted into a command line program project in xcode 6.1
The innermost loop cannot be parallelized because the new value of variable “points” is used in the next loop. But the outermost just run the innermost simulation 1000000 times and tally up the results and should be an ideal candidate for parallelization.
So my question is how to use GCD to parallelize the outermost for loop?
A "multi-threaded iteration" can be done with dispatch_apply():
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
for innerIdx in 1 ... innerCount {
// ...
}
}
(You have to figure out the best relation between outer and inner counts.)
There are two things to notice:
arc4random() uses an internal mutex, which makes it extremely slow when called
from several threads in parallel, see Performance of concurrent code using dispatch_group_async is MUCH slower than single-threaded version. From the answers given there,
rand_r() (with separate seeds for each thread) seems to be faster alternative.
The result variable winner must not be modified from multiple threads simultaneously.
You can use an array instead where each thread updates its own element, and the results
are added afterwards. A thread-safe method has been described in https://stackoverflow.com/a/26790019/1187415.
Then it would roughly look like this:
let outerCount = 100 // # of concurrent block iterations
let innerCount = 10000 // # of iterations within each block
let the_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
var winners = [Int](count: outerCount, repeatedValue: 0)
winners.withUnsafeMutableBufferPointer { winnersPtr -> Void in
dispatch_apply(UInt(outerCount), the_queue) { outerIdx -> Void in
var seed = arc4random() // seed for rand_r() in this "thread"
for innerIdx in 1 ... innerCount {
var points = 0
var ability = 500
for i in 1 ... 1000 {
let chance = Int(rand_r(&seed) % 1001)
if chance < (ability-points) { ++points }
else {points = points - 1}
}
if points > 0 {
winnersPtr[Int(outerIdx)] += 1
}
}
}
}
// Add results:
let winner = reduce(winners, 0, +)
println(winner)
Just to update this for contemporary syntax, we now use concurrentPerform (which replaces dispatch_apply).
So we can replace
for j in 0 ..< 1_000_000 {
for i in 0 ..< 1000 {
...
}
}
With
DispatchQueue.concurrentPerform(1_000_000) { j in
for i in 0 ..< 1000 {
...
}
}
Note, parallelizing introduces a little overhead, in both the basic GCD dispatch mechanism, as well as the synchronization of the results. If you had 32 iterations in your parallel loop this would be inconsequential, but you have a million iterations, and it will start to add up.
We generally solve this by “striding”: Rather than parallelizing 1 million iterations, you might only do 100 parallel iterations, doing 10,000 iterations each. E.g. something like:
let totalIterations = 1_000_000
let stride = 10_000
let (quotient, remainder) = totalIterations.quotientAndRemainder(dividingBy: stride)
let iterations = quotient + remainder == 0 ? 0 : 1
DispatchQueue.concurrentPerform(iterations: iterations) { iteration in
for j in iteration * stride ..< min(totalIterations, (iteration + 1) * stride) {
for i in 0 ..< 1000 {
...
}
}
}

Composing VTK file from multiple MPI outputs

For a Lattice Boltzmann simulation of a lid-driven cavity (CFD) I'm decomposing my cubic domain into (also cubic) 8 subdomains, which are computed independently by 8 ranks. Each MPI rank is producing a VTK file for each timestep and since I'm using ParaView I want to visualize the whole thing as one cube. To be more specific about what I am trying to achieve:
I have a cube with length 8 (number of elements for each direction) => 8x8x8 = 512 elements.
Each dimension is distributed to 2 ranks, i.e. every rank handles 4x4x4 = 64 elements.
Every rank writes it's result to a file called lbm_out_<rank>.<timestep>.vts in a VTK StructuredGrid format.
I want to produce a .pvts file that collects the *.vts files and combines the files containing the subdomains to a single file that ParaView can treat as whole domain.
Unfortunately I'm facing many issues with that since I feel ParaView and VTK are extremely poorly documented and the error messages from ParaView are totally useless.
I'm having the following *.pvts file, which includes a ghost layer and:
<?xml version="1.0"?>
<VTKFile type="PStructuredGrid" version="0.1" byte_order="LittleEndian">
<PStructuredGrid WholeExtent="0 7 0 7 0 7 " GhostLevel="1">
<PPoints>
<PDataArray NumberOfComponents="3" type="Float32" />
</PPoints>
<Piece Extent="0 4 0 4 0 4" Source="lbm_out_0.0.vts"/>
<Piece Extent="3 7 0 4 0 4" Source="lbm_out_1.0.vts"/>
<Piece Extent="0 4 3 7 0 4" Source="lbm_out_2.0.vts"/>
<Piece Extent="3 7 3 7 0 4" Source="lbm_out_3.0.vts"/>
<Piece Extent="0 4 0 4 3 7" Source="lbm_out_4.0.vts"/>
<Piece Extent="3 7 0 4 3 7" Source="lbm_out_5.0.vts"/>
<Piece Extent="0 4 3 7 3 7" Source="lbm_out_6.0.vts"/>
<Piece Extent="3 7 3 7 3 7" Source="lbm_out_7.0.vts"/>
</PStructuredGrid>
</VTKFile>
Having that file, which I feel should work correctly (note that there are not parameters yet, just plain geometry information), my domain ranges are totally messed up, although each *.vts file works fine on its own. I have a screenshot attached to make things more clear:
What may be the problem? Is it possible to use legacy VTK files for this tasks? May I be doing something totally wrong? I really don't know how to accomplish this task and the resources I find in google are very limited. Thank you.
Unfortunately there is no example for vtkXMLPStructuredGridWriter class (VTK Classes not used in the Examples). So I decided to write the simplest code to generate *.vts and .pvts files for a structured grid, very similar to the case you are looking for.
The following code uses MPI and VTK to write parallel structured grid files. In this example, we have two processes which create their own .vts files and the vtkXMLPStructuredGridWriter class writes the .pvts file:
// MPI Library
#include <mpi.h>
//VTK Library
#include <vtkXMLPStructuredGridWriter.h>
#include <vtkStructuredGrid.h>
#include <vtkSmartPointer.h>
#include <vtkFloatArray.h>
#include <vtkCellData.h>
#include <vtkMPIController.h>
#include <vtkProgrammableFilter.h>
#include <vtkInformation.h>
struct Args {
vtkProgrammableFilter* pf;
int local_extent[6];
};
// function to operate on the point attribute data
void execute (void* arg) {
Args* args = reinterpret_cast<Args*>(arg);
auto info = args->pf->GetOutputInformation(0);
auto output_tmp = args->pf->GetOutput();
auto input_tmp = args->pf->GetInput();
vtkStructuredGrid* output = dynamic_cast<vtkStructuredGrid*>(output_tmp);
vtkStructuredGrid* input = dynamic_cast<vtkStructuredGrid*>(input_tmp);
output->ShallowCopy(input);
output->SetExtent(args->local_extent);
}
int main (int argc, char *argv[]) {
MPI_Init (&argc, &argv);
int myrank;
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
int lx {10}, ly{10}, lz{10}; //local dimensions of the process's grid
int dims[3] = {lx+1, ly+1, lz+1};
int global_extent[6] = {0, 2*lx, 0, ly, 0, lz};
int local_extent[6] = {myrank*lx, (myrank+1)*lx, 0, ly, 0, lz};
// Create and Initialize vtkMPIController
auto contr = vtkSmartPointer<vtkMPIController>::New();
contr->Initialize(&argc, &argv, 1);
int nranks = contr->GetNumberOfProcesses();
int rank = contr->GetLocalProcessId();
// Create grid points, allocate memory and Insert them
auto points = vtkSmartPointer<vtkPoints>::New();
points->Allocate(dims[0]*dims[1]*dims[2]);
for (int k=0; k<dims[2]; ++k)
for (int j=0; j<dims[1]; ++j)
for (int i=0; i<dims[0]; ++i)
points->InsertPoint(i + j*dims[0] + k*dims[0]*dims[1],
i+rank*(dims[0]-1), j, k);
// Create a density field. Note that the number of cells is always less than
// number of grid points by an amount of one so we use dims[i]-1
auto density = vtkSmartPointer<vtkFloatArray>::New();
density->SetNumberOfComponents(1);
density->SetNumberOfTuples((dims[0]-1)*(dims[1]-1)*(dims[2]-1));
density->SetName ("density");
int index;
for (int k=0; k<lz; ++k)
for (int j=0; j<ly; ++j)
for (int i=0; i<lx; ++i) {
index = i + j*(dims[0]-1) + k*(dims[0]-1)*(dims[1]-1);
density->SetValue(index, i+j+k);
}
// Create a vtkProgrammableFilter
auto pf = vtkSmartPointer<vtkProgrammableFilter>::New();
// Initialize an instance of Args
Args args;
args.pf = pf;
for(int i=0; i<6; ++i) args.local_extent[i] = local_extent[i];
pf->SetExecuteMethod(execute, &args);
// Create a structured grid and assign point data and cell data to it
auto structuredGrid = vtkSmartPointer<vtkStructuredGrid>::New();
structuredGrid->SetExtent(global_extent);
pf->SetInputData(structuredGrid);
structuredGrid->SetPoints(points);
structuredGrid->GetCellData()->AddArray(density);
// Create the parallel writer and call some functions
auto parallel_writer = vtkSmartPointer<vtkXMLPStructuredGridWriter>::New();
parallel_writer->SetInputConnection(pf->GetOutputPort());
parallel_writer->SetController(contr);
parallel_writer->SetFileName("test.pvts");
parallel_writer->SetNumberOfPieces(nranks);
parallel_writer->SetStartPiece(rank);
parallel_writer->SetEndPiece(rank);
parallel_writer->SetDataModeToBinary();
parallel_writer->Update();
parallel_writer->Write();
contr->Finalize();
// WARNING: it seems that MPI_Finalize is not necessary since we are using
// Finalize() function from vtkMPIController class. Uncomment the following
// line and see what happens.
// MPI_Finalize ();
return 0;
}
This code only writes some data (in this case density which is a scalar) into a file and do not render it. For visualization you will need a software like Paraview.
To run this code, you can use this CMake file:
cmake_minimum_required(VERSION 2.8)
PROJECT(PXMLStructuredGridWriter)
add_executable(PXMLStructuredGridWriter parallel_vtk.cpp)
find_package(VTK REQUIRED)
include(${VTK_USE_FILE})
target_link_libraries(PXMLStructuredGridWriter ${VTK_LIBRARIES})
find_package(MPI REQUIRED)
include_directories(${MPI_INCLUDE_PATH})
target_link_libraries(PXMLStructuredGridWriter ${MPI_LIBRARIES})
At the end you will find an xml file like this in the same directory that you have the executable:
<?xml version="1.0"?>
<VTKFile type="PStructuredGrid" version="0.1" byte_order="LittleEndian" header_type="UInt32" compressor="vtkZLibDataCompressor">
<PStructuredGrid WholeExtent="0 20 0 10 0 10" GhostLevel="2">
<PCellData>
<PDataArray type="Float32" Name="density"/>
</PCellData>
<PPoints>
<PDataArray type="Float32" Name="Points" NumberOfComponents="3"/>
</PPoints>
<Piece Extent="0 10 0 10 0 10" Source="test_0.vts"/>
<Piece Extent="10 20 0 10 0 10" Source="test_1.vts"/>
</PStructuredGrid>
</VTKFile>

Resources