How to write to specific datanode in hdfs using pyspark - hadoop

I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.
For instance, we have a file, table1.csv
Id, data
1, A
1, B
2, C
2, D
And another tablet.csv
Id, data
1, X
1, Y
2, Z
2, X1
Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y)
and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1)
And replication within datanodes.
It can be separate files as well based on keys. But each key should map it to a particular node.
I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.
Read about rackid by setting rack topology but is there away to select which rack to store data on?
Any help is appreciated, I'm totally stuck.
KR
Alex

I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.
If you want to shift the problem to resource starvation:
Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try.
Pro: just wait until you get a node with the data. Cheap and fast to implement.
Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
** yarn labels**
to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first.
Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)
Use Cluster federation
Ensures data lands inside a certain set of machines.
pro: A folder can be assigned to a set of data nodes
Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
My-Data-is-Everywhere
hadoop dfs -setrep -w 10000 /path of the file
Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster.
Pro: All your data nodes have the data you need.
Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.
Whack-a-mole:
Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on.
Pro: You fulfill your requirement.
Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.
Racking-my-brain
Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.

You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)

Related

Strange replication in Cassandra

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!
The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

How to shard on user id with hdfs?

I'd like to use a hadoop/hdfs-based system, but I'm a bit concerned as I think I will want to have all data for one user on the same physical machine. Is there a way of accomplishing this in the hadoop-based universe?
During hdfs data write process, the datablock is written first in to node from which the client is accessing the cluster if the node is a datanode.
In order to solve your problem. The edge nodes will also be datanodes. Edge nodes are from where the user starts interacting to the cluster.
But using datanodes as edgenodes has some disadvantages. One of them include Data distribution. The data distribution will not be even and if the node fails, cluster re-balancing will be very costly.

Remotely retrieve a file from hdfs and store it locally in a node

I want to write a job in which each mapper checks if a file from hdfs is stored in the node that is being executed.If this doesn't happen I want to retrieve it from hdfs and store it locally in this node.Is this possible?
EDIT: I am trying to do this (3) Preprocessing for Repartition Join, as described here: link
DistributedCache feature in Hadoop can be used to distribute the side data or auxiliary data required for the completion of the job. Here (1, 2) are some interesting articles for the same.
Why would you want to do this? The Data Locality principle used by Hadoop does this for you. Well, it does not move the data, it does move the program.
This comes from the Wikipedia page about Hadoop:
The jobtracker schedules map/reduce jobs to tasktrackers with an
awareness of the data location. An example of this would be if node A
contained data (x,y,z) and node B contained data (a,b,c). The
jobtracker will schedule node B to perform map/reduce tasks on (a,b,c)
and node A would be scheduled to perform map/reduce tasks on (x,y,z)
And the reason the computation is moved to the data and not the other way around is explained in the Hadoop documentation itself:
“Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed
near the data it operates on. This is especially true when the size of
the data set is huge. This minimizes network congestion and increases
the overall throughput of the system. The assumption is that it is
often better to migrate the computation closer to where the data is
located rather than moving the data to where the application is
running. HDFS provides interfaces for applications to move themselves
closer to where the data is located.

How to explicilty define datanodes to store a particular given file in HDFS?

I want to write a script or something like .xml file which explicitly defines the datanodes in Hadoop cluster to store a particular file blocks.
for example:
Suppose there are 4 slave nodes and 1 Master node (total 5 nodes in hadoop cluster ).
there are two files file01(size=120 MB) and file02(size=160 MB).Default block size =64MB
Now I want to store one of two blocks of file01 at slave node1 and other one at slave node2.
Similarly one of three blocks of file02 at slave node1, second one at slave node3 and third one at slave node4.
So,my question is how can I do this ?
actually there is one method :Make changes in conf/slaves file every time to store a file.
but I don't want to do this
So, there is another solution to do this ??
I hope I made my point clear.
Waiting for your kind response..!!!
There is no method to achieve what you are asking here - the Name Node will replicate blocks to data nodes based upon rack configuration, replication factor and node availability, so even if you do managed to get a block on two particular data nodes, if one of those nodes goes down, the name node will replicate the block to another node.
Your requirement is also assuming a replication factor of 1, which doesn't give you any data redundancy (which is a bad thing if you lose a data node).
Let the namenode manage block assignments and use the balancer periodically if you want to keep your cluster evenly distibuted
NameNode is an ultimate authority to decide on the block placement.
There is Jira about the requirements to make this algorithm pluggable:
https://issues.apache.org/jira/browse/HDFS-385
but unfortunetely it is in the 0.21 version, which is not production (alhough working not bad at all).
I would suggest to plug you algorithm to 0.21 if you are on the research state and then wait for 0.23 to became production, or, to downgrade the code to 0.20 if you do need it now.

How can I be sure that data is distributed evenly across the hadoop nodes?

If I copy data from local system to HDFS, сan I be sure that it is distributed evenly across the nodes?
PS HDFS guarantee that each block will be stored at 3 different nodes. But does this mean that all blocks of my files will be sorted on same 3 nodes? Or will HDFS select them by random for each new block?
If your replication is set to 3, it will be put on 3 separate nodes. The number of nodes it's placed on is controlled by your replication factor. If you want greater distribution then you can increase the replication number by editing the $HADOOP_HOME/conf/hadoop-site.xml and changing the dfs.replication value.
I believe new blocks are placed almost randomly. There is some consideration for distribution across different racks (when hadoop is made aware of racks). There is an example (can't find link) that if you have replication at 3 and 2 racks, 2 blocks will be in one rack and the third block will be placed in the other rack. I would guess that there is no preference shown for what node gets the blocks in the rack.
I haven't seen anything indicating or stating a preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at whatever value) a simple option is $HADOOP_HOME/bin/start-balancer.sh which will run a balancing process to move blocks around the cluster automatically.
This and a few other balancing options can be found in at the Hadoop FAQs
Hope that helps.
You can open HDFS Web UI on port 50070 of Your namenode. It will show you the information about data nodes. One thing you will see there - used space per node.
If you do not have UI - you can look on the space used in the HDFS directories of the data nodes.
If you have a data skew, you can run rebalancer which will solve it gradually.
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
Yes, Hadoop distributes data per block, so each block would be distributed separately.

Resources