I work in a slurm based cluster, normally the files needed to launch calculations are copied to the scratch of the compute nodes (/state/partition1).
Now we have 2 different scratch disks on each compute node, one HDD (/state/partition1) and one SSD (/state/partition2). I would like the users to declare the weight that their calculations are going to require and depending on that weight to use the first scratch or the second scratch.
How can I do this?
Related
I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.
For instance, we have a file, table1.csv
Id, data
1, A
1, B
2, C
2, D
And another tablet.csv
Id, data
1, X
1, Y
2, Z
2, X1
Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y)
and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1)
And replication within datanodes.
It can be separate files as well based on keys. But each key should map it to a particular node.
I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.
Read about rackid by setting rack topology but is there away to select which rack to store data on?
Any help is appreciated, I'm totally stuck.
KR
Alex
I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.
If you want to shift the problem to resource starvation:
Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try.
Pro: just wait until you get a node with the data. Cheap and fast to implement.
Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
** yarn labels**
to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first.
Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)
Use Cluster federation
Ensures data lands inside a certain set of machines.
pro: A folder can be assigned to a set of data nodes
Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
My-Data-is-Everywhere
hadoop dfs -setrep -w 10000 /path of the file
Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster.
Pro: All your data nodes have the data you need.
Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.
Whack-a-mole:
Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on.
Pro: You fulfill your requirement.
Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.
Racking-my-brain
Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.
You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)
Data locality as defined by many Hadoop tutorial sites (i.e. https://techvidvan.com/tutorials/data-locality-in-hadoop-mapreduce/) states that: "Data locality in Hadoop is the process of moving the computation close to where the actual data resides instead of moving large data to computation. This minimizes overall network congestion."
I can understand having the node where the data resides process the computation for those data, instead of moving data around, would be efficient. However, what does it mean by "moving the computation close to where the actual data resides"? Does this mean that if the data sits in a server in Germany, it is better to use the server in France to do the computation on those data instead of using the server in Singapore to do the computation since France is closer to Germany than Singapore?
Typically people talk about this on a quite different scale, especially within a Hadoop context.
Suppose you have a cluster of 5 nodes, you store a file there and need to do a calculation on it.
With data locality you try to make the calculation happen on the node(s) where the data is stored (rather than for example the first node that has compute resources available).
This reduces network load.
It is good to realize that in many new infrastructures the network is not the bottleneck, so you will keep hearing more about the decoupling of compute and storage.
I +1 Dennis Jaheruddin's answer, and just wanted to add -- you can actually see different locality levels in MR when you check job counters, in Job History UI for example.
HDFS and YARN are rack-aware so its not just binary same-or-other node: in the above screen, Data-local means the task was running local to the machine that contained actual data; Rack-local -- that the data wasn't local to the node running the task and needed to be copied, but was still on the same rack; and finally the Other local case -- where the data wasn't available local, nor on the same rack, so it had to be copied over two switches to the node that run the computation.
I am using SpatialHadoop to store and index a dataset with 87 million points. I then apply various range queries.
I tested on 3 different cluster configurations: 1 , 2 and 4 nodes.
Unfortunately, I don't see a runtime decrease with growing node number.
Any ideas why there is no horizontal-scaling effect?
How big is your file in megabytes? While it has 87 million points, it can still be small enough that Hadoop decides to create one or two splits only out of it.
If this is the case, you can try reducing the block size in your HDFS configuration so that the file will be split into several blocks.
Another possibility is that you might be running virtual nodes on the same machine which means that you do not get a real distributed environment.
I know Cassandra is good in multiple nodes set up. The more nodes,the better performance. If I have two dedicated servers with same hardware, it would be good I create some virtual machines in both of them to have more nodes, or not?
For example I have two dedicated server with this specifications:
1TB hard drive
64 GB RAM
8 core CPU
then create 8 virtual machine(nodes) in both of them. each of them has:
~150GB hard drive
8 GB RAM
share 8 core CPU
So I have 16 nodes. Are these 16 nodes had better performance than 2 nodes with this two dedicated server?
In the other word which side of this trade off is better, more nodes with lower hardware or two stronger nodes?
I know it should be tested, but I want to know basically is it reasonable or not?
Adding new nodes always adds some overhead, they need to communicate within each other and sync their data. Therefore, the more nodes you add, you'd expect the overhead to increase with adding each node. You'd add more nodes only in a situation where the existing number of nodes can't handle the input/output demands. Since in the situation you are describing , you'd be actually writing on the same disk, you'd actually effectively be slowing down your cluster by adding more nodes.
Imagine the situation: you have a server, it receives some data and then writes it on disk. Now imagine the same situation, where the disk is shared between two servers and they both write the same information at the almost same time on the same disk. The two servers also use cpu cycles to communicate between each other that the data has been written so they can sync up. I think this is a sufficient enough information to describe to you why what you are thinking is not a good idea if you can avoid it.
EDIT:
Of course, this is the information only in layman's terms, C* has a very nice architecture in which data is actually spread according to an algorithm to a certain range of nodes (not all of them) and when you are querying for a specific key, the algorithm actually can tell you where to find the data. With that said, when you add and remove nodes, the new nodes have to communicate with the cluster that they want to share 'the burden' and as a result, a recalculation of what is known as a 'token-ring' takes place at the end of which data may be shuffled around so it is accessible in a predictable way.
You can take a look at this:
http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2
But in general, there is indeed some overhead when nodes communicate with each other, but the number of the nodes would almost never negatively or positively impact your query speed dramatically if you are querying for a single key.
"I know it should be tested, but I want to know basically is it reasonable or not?"
That will answer most of your assumptions.
The basic advantage of using cassandra is availability. If you are planning to have just two dedicated servers, then there is a question mark on your availability of data. Considering the worst case, you always have just two replicas of data at any point of time.
My take is to go for a nicely split dedicated set up in small chunks. Everything boils down to your use case.
1.If you have a lot of data flowing in and if you consider data as king(in such a case , you need more replicas to handle in case of failures), i would prefer a high end distributed set up.
2.If you are looking for the other way around(data is not your forte and your data is just another part of your set up), you shall just go for the set up what you have mentioned.
3.If you have a cost constraint and if you are a start up with a minimal data that is important to you, set up in two nodes what you have with replication of 2(Simple Strategy ) and replication of 1(Network Topology)
We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2).
Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from
5000 points (130KB) to 10000 points (260KB).
Observations:
1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).
2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.
3- Is there a way to see more information about exact splits for a running job.
Thanks in advance
What version of Hadoop are you using? I am going to assume a later version that uses YARN.
1) Hadoop should distribute the map tasks among your cluster automatically and not favor any specific nodes. It will place a map task as close to the data as possible, i.e. it will choose a NodeManager on the same host as a DataNode hosting a block. If such a NodeManager isn't available, then it will just pick a node to run your task. This means you should see all of your slave nodes running tasks when your job is launched. There may be other factors blocking Hadoop from using a node, such as the NodeManager being down, or not enough memory to start up a JVM on a specific node.
2) Is your file size slightly above 64MB? Even one byte over 67,108,864 bytes will create two splits. The CartesianInputFormat first computes the cross product of all the blocks in your data set. Having a file that is two blocks will create four splits -- A1xB1, A1xB2, A2xB1, A2xB2. Try a smaller file and see if you are still getting four splits.
3) You can see the running job in the UI of your ResourceManager. https://:8088 will open the main page (jobtracker-host:50030 for MRv1) and you can navigate to your running job from there, which will get you to see individual tasks that are running. If you want more specifics on what the input format is doing, add some log statements to the CartesianInputFormat's getSplits method and re-run your code to see what is going on.