How to break a nested Python for-loop into many individual jobs and then run them on HTCondor? - parallel-processing

I have a nested for loop with this kind of logic:
As, Bs, Cs = [...], [...], [...]
for a in As:
for b in Bs:
for c in Cs:
result = function(a, b, c)
I want to break this into many HTCondor jobs, where each one takes an a, b, c value until every combination has been complete (which is what the nested for loop is doing). How can I do this using HTCondor?
I have tried writing different job scripts but they all just complete immediately with no output. I have also tried using HTmap however this gives a pile of Docker errors.

Related

Pylint explanation of R1712

I'm getting this error when using pylint on my project
consider-swap-variables (R1712):
Consider using tuple unpacking for swapping variables You do not have to use a temporary variable in order to swap variables. Using "tuple unpacking" to directly swap variables makes the intention more clear.
and my code is
init_acc_src = acc_src
can some one can explane how should it be done correctly based on pylint?
I think you are swapping variables here, probably we'd need to see more than one line.
I've created a dummy example
a = 5
b = 7
c = a
a = b
b = c
which also raises in line 3 (c=a)
dummy_swap.py:3:0: R1712: Consider using tuple unpacking for swapping variables (consider-swap-variables)
The recommended way of swapping variables in python is the much shorter
a = 5
b = 7
a, b = b, a

How can I pass multiple parameters to a parallel operation in Octave?

I wrote a function that acts on each combination of columns in an input matrix. It uses multiple for loops and is very slow, so I am trying to parallelize it to use the maximum number of threads on my computer.
I am having difficulty finding the correct syntax to set this up. I'm using the Parallel package in octave, and have tried several ways to set up the calls. Here are two of them, in a simplified form, as well as a non-parallel version that I believe works:
function A = parallelExample(M)
pkg load parallel;
# Get total count of columns
ct = columns(M);
# Generate column pairs
I = nchoosek([1:ct],2);
ops = rows(I);
slice = ones(1, ops);
Ic = mat2cell(I, slice, 2);
## # Non-parallel
## A = zeros(1, ops);
## for i = 1:ops
## A(i) = cmbtest(Ic{i}, M);
## endfor
# Parallelized call v1
A = parcellfun(nproc, #cmbtest, Ic, {M});
## # Parallelized call v2
## afun = #(x) cmbtest(x, M);
## A = parcellfun(nproc, afun, Ic);
endfunction
# function to apply
function P = cmbtest(indices, matrix)
colset = matrix(:,indices);
product = colset(:,1) .* colset(:,2);
P = sum(product);
endfunction
For both of these examples I generate every combination of two columns and convert those pairs into a cell array that the parcellfun function should split up. In the first, I attempt to convert the input matrix M into a 1x1 cell array so it goes to each parallel instance in the same form. I get the error 'C must be a cell array' but this must be internal to the parcellfun function. In the second, I attempt to define an anonymous function that includes the matrix. The error I get here specifies that 'cmbtest' is undefined.
(Naturally, the actual function I'm trying to apply is far more complex than cmbtest here)
Other things I have tried:
Put M into a global variable so it doesn't need to be passed. Seemed to be impossible to put a global variable in a function file, though I may just be having syntax issues.
Make cmbtest a nested function so it can access M (parcellfun doesn't support that)
I'm out of ideas at this point and could use help figuring out how to get this to work.
Converting my comments above to an answer.
When performing parallel operations, it is useful to think of each parallel worker that will result as separate and independent octave instances, which need to have appropriate access to all functions and variables they will require in order to do their independent work.
Therefore, do not rely on subfunctions when calling parcellfun from a main function, since this might lead to errors if the worker is unable to access the subfunction directly under the hood.
In this case, separating the subfunction into its own file fixed the problem.

Not getting full result when output is on one line

Im new at Prolog , and I am trying some manipulation on graphs .
I have a problem in my implementation, and since it is very long and complicated to expose , I will give a simple and similar problem .
Let say we have the following graph :
edge(a,e).
edge(e,d).
edge(d,c).
edge(c,b).
edge(b,a).
edge(d,a).
edge(e,c).
edge(f,b).
And I wanted to make this graph bidirected . I use the following code :
graph(Graph):-findall(A-B, edge(A,B), L),
findall(B-A, edge(A,B), L1),
append(L, L1, Graph).
when executing the query I get this result :
?- graph(Graph).
Graph = [a-e, b-a, c-b, d-a, d-c, e-c, e-d, f-b, ... - ...|...].
My problem is not in the code my problem is in the results as you can see I don't get the complete results, its always giving me only 8 edges and the rest are not shown.
How to solve this ?
graph(Graph),writeln(Graph).
writeln/1 writes out the entire variable to the output.
From #WillemVanOnsem
If you write graph(G);true. then the program will pause after the first statement. Next you can hit W and it will again write the answer but now in full.

Hadoop Pig UDF invocation issue

The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead?
register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();
-- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5),(12)})
A = load 'input.txt' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
F1 = foreach A generate B1;
F2 = foreach A generate B2;
differenced = FOREACH A {
-- input bags must be sorted
sorted_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE setDifference(sorted_b1,sorted_b2);
}
-- produces: ({(2),(4),(6),(7)})
DUMP differenced;
Update:
Question is, suppose I have two bags already, how to call UDF setDifference to get set differences? Do I need to build another super bag which contains the two separate bags? Thanks.
thanks in advance,
Lin
I don't see any overhead issue with the UDF invocation.
Ref : http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html, we have a example for using SetDifference method.
As per API (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html) SetDifference method takes bags as input and emits the difference between them.
N.B. Do note that the input bags have to be sorted.
In the example snippet shared, I don't see the need of below code snippet
F1 = foreach A generate B1;
F2 = foreach A generate B2;

Hadoop Pig - Optimizing Word Count

In the canonical pig wordcount example, I'm curious how folks approach optimizing the condition where grouping by word could result in a bag with many (many) elements.
For example:
A = load 'input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
In line C, if there is a word, let's say "the", that occurs 1 billion times in the input file, this can result in the reducer hanging for a very long time while processing. What can be done to optimize this?
In any case, PIG will assess if a combiner can be used and will have one if so.
In the case of your example, it will obviously introduce a combiner which will reduce the number of key value pairs per word to a few or only one in best case. So on the reducer side you will not end up with huge number of key/ values per a given word.

Resources