How to let kedro execute nodes in sequence - kedro

I am trying to use kedro to run a workflow. Following figure is my workflow(node 1-3 is sequential and nodes 31, 32 and 33 is three branches which from node 3). You can see the kedro is running sequentially from 1 to 3, due to the clearly dependency among nodes. However, when it comes to the 31, 32 and 33 nodes, the kedro just ran really randomly. It can start from 31 or 32 or 33. Does anyone has any idea that I can let the kedro ran 31 first then 32 and then 33? Thanks!
I have tried to tag all the nodes, but the running order is kept randomly when the kedro ran into the node of 31 to 33

so Kedro topologically sorts the nodes at runtime and you're not guaranteed to get the same run order.
The way that people often try to fix this issue is to break up your pipelines into sub-pipelines and call them via the CLI.
kedro run --pipeline a && kedro run --pipeline b
The other option is to create a dummy dataset dependency which forces the nodes to operate in the order you want it to run.

Related

sbatch script with number of CPUs different to total number of CPUS in cores?

I'm used to start an sbatch script in a cluster where the nodes have 32 CPUs and where my code needs a power of 2 number of processors.
For exemple i do this:
#SBATCH -N 1
#SBATCH -n 16
#SBATCH --ntasks-per-node=16
or
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --ntasks-per-node=32
However i now need to use a different cluster where each node has 40 CPUs. For the moment i'm using only one node and 32 processes to do testing:
#SBATCH --ntasks=32
#SBATCH --ntasks-per-node=32
(I got this later script from the documentation of the cluster. They don't use in this example the #SBATCH -N line, i don't know why but maybe because it is an example)
However i will now need to do larger simulations with 512 processors. The closer number of nodes i will need to use is 13 (ie 40*13=520 processors). Now the problem is that the number of task per node will not be (technically) an integer.
I think a solution will be to ask for 13 nodes where i will fully use 12 and only i will not fully use the last one.
My question is how do i do this?, Is there another way of doing this without changing the code? (It will not be possible to change the code, is a huge code).
And a simulation with 512 proc will take 10 hours minimum, so doing a larger simulation with 32 procs will take a week. And i don't only need one simulation but at least 20 for the moment.
Another solution will be to ask for 16 nodes (32*16=512) and only use 32 procs per node. However this will be a waste of processors and number of hours I'm allowed in the cluster.
Ok the answer is simple but depends on the machine you are working. But i think it should work every time.
In the case of the second cluster i don't need to specify the line --ntasks-per-node=512. I just need to tell the machine how many tasks i need in total --tasks=512, automatically the machine will allocate the corresponding number of nodes necessary to do those tasks.
Important: If your ntasks is not a multiple of the processors per node, then the last node will be not completely used. For example in my case i need 512 tasks, this corresponds to 13 nodes = 520 processors. The first 12 processors are fully used but the last one is not and leaves 8 processors empty.
Note that this can cause some optimisation problems in some codes because the processes on the last node will need to communicate with the majority of processes in the other node(s). For me is not a problem but i know another code where this is a problem.

How to suggest a more balanced allocation of containers in Hadoop cluster?

How can I change/suggest a different allocation of containers to tasks in Hadoop? Regarding a native Hadoop (2.9.1) cluster on AWS.
I am running a native Hadoop cluster (2.9.1) on AWS (with EC2, not EMR) and I want the scheduling/allocating of the containers (Mappers/Reducers) would be more balanced than it is currently.
It seems like RM is allocating the Mappers in a Bin Packing way (where the data resides) and for the reducers, it seems more balanced.
My setup includes three Machines with replication rate three (all the data is on every machine) and I run my jobs with mapreduce.job.reduce.slowstart.completedmaps=0 in order to start shuffle as fast as possible (It is important for me that all the containers concurrently, it is a must condition).
In addition, according to the EC2 instances I have chosen and my settings of the YARN cluster, I can run at most 93 containers (31 each).
For example, if I want to have 9 reducers then (93-9-1=83) 83 containers could be left for the mappers and one is for the AM.
I have played with the size of split input (mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize) in order to find the right balance where all of the machines have the same "work" for the map phase.
But it seems like the first 31 mappers would be allocated in one machine, the next 31 to the second one and the last 31 in the last machine. Thus, I can try to use 87 mappers where 31 of them in Machine #1, another 31 in Machine #2 and another 25 in Machine #3 and the rest is left for the reducers and as Machine #1 and Machine #2 are fully occupied then the reducers would have to be placed in Machine #3. This way I get an almost balanced allocation of mappers at the expense of unbalanced reducers allocation.
And this is not what I want...
# of mappers = size_input / split size [Bytes],
split size= max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.blocksize))
I was using the default scheduler (Capacity) and by default yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments was set to -1 (infinity) which explained why every node that answer first to the RM (with Heartbeat) has been "packing" containers as much as it can.
To conclude, inserting to hadoop/etc/hadoop/capacity-scheduler.xml the above parameter (using a third of the number of mappers would result in balanced scheduling of mappers) and following yarn rmadmin -refreshQueues after restarting the RM will grant you the option to balance the containers allocation in YARN.
For more details, please search my discussion here.

SpatialHadoop: no scaling with multiple computing nodes

I am using SpatialHadoop to store and index a dataset with 87 million points. I then apply various range queries.
I tested on 3 different cluster configurations: 1 , 2 and 4 nodes.
Unfortunately, I don't see a runtime decrease with growing node number.
Any ideas why there is no horizontal-scaling effect?
How big is your file in megabytes? While it has 87 million points, it can still be small enough that Hadoop decides to create one or two splits only out of it.
If this is the case, you can try reducing the block size in your HDFS configuration so that the file will be split into several blocks.
Another possibility is that you might be running virtual nodes on the same machine which means that you do not get a real distributed environment.

hadoop not creating enough containers when more nodes are used

So I'm trying to run some hadoop jobs on AWS R3.4xLarge machines. They have 16 vcores and 122 gigabytes of ram available.
Each of my mappers requires about 8 gigs of ram and one thread, so these machines are very nearly perfect for the job.
I have mapreduce.memory.mb set to 8192,
and mapreduce.map.java.opts set to -Xmx6144
This should result in approximately 14 mappers (in practice nearer to 12) running on each machine.
This is in fact the case for a 2 slave setup, where the scheduler shows 90 percent utilization of the cluster.
When scaling to, say, 4 slaves however, it seems that hadoop simply doesnt create more mappers. In fact it creates LESS.
On my 2 slave setup I had just under 30 mappers running at any one time, on four slaves I had about 20. The machines were sitting at just under 50 percent utilization.
The vcores are there, the physical memory is there. What the heck is missing? Why is hadoop not creating more containers?
So it turns out that this is one of those hadoop things that never makes sense, no matter how hard you try to figure it out.
there is a setting in yarn-default called yarn.nodemanager.heartbeat.interval-ms.
This is set to 1000. Apparently it controls the minimum period between assigning containers in milliseconds.
This means it only creates one new map task per second. This means the number of containers is limited by how many containers I have running*the time that it takes for a container to be finished.
By setting this value to 50, or better yet, 1, I was able to get the kind of scaling that is expected from a hadoop cluster. Honestly should be documented better.

Confusion of how hadoop splits work

We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2).
Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from
5000 points (130KB) to 10000 points (260KB).
Observations:
1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).
2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.
3- Is there a way to see more information about exact splits for a running job.
Thanks in advance
What version of Hadoop are you using? I am going to assume a later version that uses YARN.
1) Hadoop should distribute the map tasks among your cluster automatically and not favor any specific nodes. It will place a map task as close to the data as possible, i.e. it will choose a NodeManager on the same host as a DataNode hosting a block. If such a NodeManager isn't available, then it will just pick a node to run your task. This means you should see all of your slave nodes running tasks when your job is launched. There may be other factors blocking Hadoop from using a node, such as the NodeManager being down, or not enough memory to start up a JVM on a specific node.
2) Is your file size slightly above 64MB? Even one byte over 67,108,864 bytes will create two splits. The CartesianInputFormat first computes the cross product of all the blocks in your data set. Having a file that is two blocks will create four splits -- A1xB1, A1xB2, A2xB1, A2xB2. Try a smaller file and see if you are still getting four splits.
3) You can see the running job in the UI of your ResourceManager. https://:8088 will open the main page (jobtracker-host:50030 for MRv1) and you can navigate to your running job from there, which will get you to see individual tasks that are running. If you want more specifics on what the input format is doing, add some log statements to the CartesianInputFormat's getSplits method and re-run your code to see what is going on.

Resources