Spark Streaming Computation Jobs Division into Different Nodes - cluster-computing

Suppose we have 20 nodes in our cluster. Operation1 is to count the words in a time window of 1s, and operation2 is to sum up the results derived from operation1 over a time window of 60s (the result is of course the word count over a time window of 60s). But is there any way we can specify node 1-10 to perform operation1 and node 11-20 to perform operation2? Thanks!

Related

Optimal job interval algorithm

Let's say you have different jobs that you need to run on a regular basis (for example, you want to make API calls to different endpoints).
Let's say you need to hit two different endpoints and you want your calls to be as far away in time from each other as possible.
Example: You have two jobs, one is run once a minute, another is run twice a minute.
Solution: Start job A with interval of 60 seconds, wait 15 seconds, start job B with interval of 30 seconds.
This way the jobs will run at seconds: 0(job A), 15(job B), 45(job B), 60(job A), 75(job B), 105(job B), 120(job A), ... making a maximum interval between API calls 15 seconds while maintaining the call frequency that we need.
Can you think of an algorithm for these cases that will give optimal start times for each job so that the minimum time difference between calls in maximized? Ideally this algorithm could handle more than two jobs.
Assume we don't need to wait for the job to be finished to run it once again.
Thanks
Here is my solution if we allow the intervals to be slightly unequal.
Suppose that our calls are A[0], A[1], ..., A[n] with frequencies of f[0], f[1], ..., f[n] where the frequencies are all in the same unit. For example 60/hour, 120/hour, etc.
The total frequency with which events happen will be f = f[0] + f[1] + ... + f[n], which means that some event will be scheduled every hour/f time apart. The question is which one will happen when.
The way to imagine this is imagine we have a row of buckets filling with water. Each time we will dump a unit of water from the fullest bucket in front of us.
Since at the start we don't actually care where we start, let's initialize a vector of numbers by just assigning random numbers to them, full[0], full[1], ..., full[n]. And now our algorithm looks like this pseudocode:
Every hour/f time apart:
for each i in 0..n:
fill[i] += f[i]/f
i_choice = (select i from 0..n with the largest f[i])
fill[i_choice] -= 1
Do event A[i_choice]
This leads to events spaced as far apart as possible, but with repeating events happening in a slightly uneven rhythm. In your example that will lead to every 20 seconds doing events following the pattern ...ABBABBABBABB....

Greedy Algorithm: Assigning jobs to minimize cost

What is the best approach to take if I want to find the minimum total cost if I want to assign n jobs to a person in a sequence which have cost assigned to them? For eg. I have 2 jobs which have costs 4 and 5 respectively. Both jobs take 6 and 10 minutes respectively. So the finish time of the second job will be finish time of first job + time taken by this job. So the total cost will be finish time of each job multiplied by its cost.
If you have to assign n jobs to 1 person (or 1 machine) in scheduling literature terminology, you are looking to minimize weighted flow time. The problem is polynomially solvable.
The shortest weighted processing time sequence is optimal.
Sort and reindex jobs such that p_1/w_1 <= p_2/w_2 <= ... <= p_n/w_n,
where, p_i is the processing time of the ith job and w_i is its weight or cost.
Then, assign job 1 first, followed by 2 and so on until n.
If you look at what happens if you swap two adjacent values you will end up comparing terms like (A+c)m + (A+c+d)l and (A+d)l + (A+c+d)m, where A is the time consumed by earlier jobs, c and d are times, and l and m are costs. With some algebra and rearrangement you can see that the first version is smaller if c/m < d/l. So you could work out for each job the time taken by that job divided by its cost, and do first the jobs with smallest time per unit cost. - check: if you have a job that takes 10 years and has a cost of 1 cent, you want to do that last so that 10 year wait doesn't get multiplied by any other costs.

is ordering processes in ascending run time, an optimal way to create a set of non overlapping processes?

there are n jobs in a set, each with starting times si, and finish times fi, for ni
I'm trying to figure out if the ordering jobs in ordering ascending start time, finish time, and interval time (fi - si) is optimal or not.
I said that ordering in ascending earliest start time was not optimal in the case that the first job starts first however spans the time that 3 jobs could be started and finished.
Next I said that ordering in ascending finish time was optimal because right when a finish time is added, the next fastest ending job as added, maximizing numbers of jobs added to the non-overlapping jobs list.
However I'm not sure about the ordering fi - si is optimal.
My logic is that it is optimal, because it would list the shortest jobs which I believe would add or consider the jobs that span the lengths of other jobs last
EDIT : Optimize by maximizing the size of the non-overlapping processes list
I think there is a suprisingly simple strategy for choosing the next job which gives you a subset with the maximal number of consecutive jobs: among the jobs left which have a valid start time (in the beginning: all start times are valid; after the first job has been chosen the start time of the next job must, of course, not precede the finish time of the previously chosen job) always choose the job with the earliest finish time.
A proof that this strategy is optimal can start like this: assume you have an optimal (i.e. maximal) subset of consecutive jobs and that the first job is not the job with the (overall) earliest finish time, then this job with the overall earliest finish time cannot be in the optimal subset, but you can replace the first job of the optimal subset with this job and you get another optimal subset which has the job with earliest finish time as first job. Now you can continue in the same way with the second job and thus it is clear that in the subset generated with the above strategy the n-th job has a finish time that does not exceed the finish time of the n-th job of any optimal subset, for any n, and hence the so created subset is also optimal.

How to solve this task using Topological sort?

There are N modules in the project. Each module has
(i) Completion time denoted in number of hours (Hi) and may depend on other modules. If Module x depends on Module y then one needs to complete y before x. s Project manager, you are asked to deliver the project as early as possible. Provide an estimation of amount of time required to complete the project.
Input Format:
First line contains T, number of test cases.
For each test case: First line contains N, number of modules. Next N lines, each contain: (i) Module ID (Hi) Number of hours it takes to complete the module (D) Set of module ids that i depends on - integers delimited by space.
Output Format:
Output the minimum number of hours required to deliver the project.
Input: 1
5
1 5
2 6 1
3 3 2
4 2 3
5 1 3
output: 16
I know the problem is related to topological sorting.But cant get idea how to find total hours.
You are looking for the length of the critical path. The is the longest path through the network from start to finish in the digraph where the nodes are the tasks, arrows from a node A to node B represent prerequisite relationships (A must be done before B begins) and the weight of an arrow is the time it takes to complete the source node task. If there isn't any well-defined start and end node it is common to create dummy nodes for that purpose. Create a 0-cost arrow from the start node to all tasks with no prerequisites, and a 0-cost arrow from all nodes which aren't prerequisites to anything else to the end node. Furthermore, the start and end nodes themselves are just book-keeping devices, they themselves shouldn't correspond to tasks which take any time to complete.
Topological sorting doesn't find it for you but is rather a form of pre-processing that allows you to find the critical path in a single pass. You use it to sort the nodes in such a way that the first node listed has no prerequisites and, when you come to a node in the sorted list, you are guaranteed that all prerequisite nodes have been processed. You process them by assigning a minimum start time for each task. The first node (the start node) in the sorted list has start time 0. When you get to a node for which all prerequisite nodes have been processed, the min start time of that node is
max({m_i + t_i })
where i ranges over all prerequisite nodes, m_i is the min start time for node i and t_i is the time it takes to do the task for node i. The point is that m_i + t_i is the minimum finish time for node i and you take the max of such things because all prerequisite tasks must be finished before a given task can be begu. The minimum start time of the end node is the length of the critical task.
create a directed graph G if a depends on b add a directed edge in G from b to a apply topological sort on G it lets say we stored it in a array called TOPO[],intialize time=H(0)
now run a loop over TOPO array starting from the second element.
check if TOPO[i] depends on TOPO[i-1] if it is so we have to perform them one after the other so add their task times
time=time+H(i)
if TOPO[i] does not dependent on TOPo[i-1] then we can perform them together so take a maximum of thier task times
time=max(time,H(i))
after the end of the loop variable time will have your answer
"
do this for every component separately and take the maximum of all

weighted intervals shifting, finding optimal distribution

I'm looking for algorithm name (if it already exists) or some hints to solve this problem.
I have set of N jobs, each job contains intervals, which may or may not overlap. All intervals within one particular job have same weight, length and maximum shift value.
And what I want is to find best (or close to best) distribution by shifting all intervals from one job to minimize peaks. You can only shift intervals forward (positive value). The output of this algorithm would be values of shifted intervals.
Example:
We have jobs A, B and C.
job A: length=2, weight=1, max shift=0 (cannot be moved)
job B: length=1, weight=3, max shift=2
job C: length=3.5, weight=5, max shift=15
As you can see in the first picture, there were three peaks (first between 2 and 3, second just before 4 and last peak was around 6).
After optimization in the second picture, you can see that two of three peaks were removed by shifting intervals B and C by some value. The second peak cannot be optimized because intervals in job C overlap and we can only move all intervals at once.
The output for this example would be: job A: 0, job B: 1.5, job C: 9.5
Thank you.
You can try a share-aware algorithm for machine colocation:http://people.cs.umass.edu/~ramesh/Site/PUBLICATIONS_files/SindelarSS11.pdf.

Resources