Understanding of Hadoop parallel processing

Understanding of Hadoop parallel processing - hadoop

I'm very new in Hadoop and recently configured Hadoop inside Virtual box with Ubuntu, here Name Node and Resource Manager are configured independent machines and also 3 seprate datanodes, and one client node.
After read some more articles, I got an understanding mapreduce jobs are running on multiple nodes in parallel,
For my understanding, I have written a Mapreduce program and access the hostName of the system as key in map function, this is for I want to understand the parallelism
I have loaded the data to Hdfs, 200MB of data with 64 MB block size, confirmed 3 datanodes having blocks
After exported the jar and run from client using yarn jar as well as hadoop jar, my expectation is it would be get the three datanodes name in side the reducer, but it shows Client system Name
Please can you explain how this execution ( Hadoop jar) works, Is it run my mapreduce jar in all three nodes, if then why it shows the client host name instead of three datanodes

Related

MapReduce Architecture

I have created a diagram that represents how the MapReduce framework works. Could somebody please validate that this is an accurate representation?
P.S. For the purpose of this example, we are also interested in the system components shown in this diagram.

The MapReduce Architecture works in various different phases for executing a job. Here are the different stages of running a MapReduce Application -
The first stage involves the user writing his data into the HDFS for further processing. This data is stored on different nodes in the form of blocks in the HDFS.
Now the client submits its MapReduce job.
Then, the resource manager launches a container to start the App master.
The App master sends a resource request to the resource manager.
The resource manager now allocates containers on slaves via the node manager.
The App master starts respective tasks in the containers.
The job is now been executed in the container.
When the processing is complete, the resource manager deallocates the resources.
Source: Cloudera

JobTracker, TaskTracker, and MasterNode aren't real things in Hadoop 2+ w/ YARN. Jobs are submitted to a ResourceManager, which creates an ApplicationMaster on one of the NodeManagers.
"Slave Nodes" are commonly also your DataNodes because that is the core tenant of Hadoop - move the processing to the data.
The "Recieve the data" arrow is bi-directional, and there is no arrow from the NameNode to the DataNode. 1) Get the file locations from the NameNode, then locations are sent back to clients. 2) The clients (i.e. NodeManager processes running on a DataNode, or "slave nodes"), will directly read from the DataNodes themselves - the datanodes don't know directly where the other slave nodes exist.
That being said, HDFS and YARN are typically all part of the same "bubble", so the "HDFS" labelled circle you have should really be around everything.

Spark won't run final `saveAsNewAPIHadoopFile` method in yarn-cluster mode

I wrote a Spark application, that reads some CSV files (~5-10 GB), transforms the data and converts the data into HFiles. The data is read from and saved into HDFS.
Everything seems to work fine when I run the application in the yarn-client mode.
But when I try to run it as yarn-cluster application, the process seems not to run the final saveAsNewAPIHadoopFile action on my transformed and ready-to-save RDD!
Here is a snapshot of my Spark UI, where you can see that all the other Jobs are processed:
And the corresponding Stages:
Here the last step of my application where the saveAsNewAPIHadoopFile method is called:
JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ...
try {
Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab");
Configuration baseConf = c.getConfiguration();
baseConf.set("hbase.zookeeper.quorum", HBASE_HOST);
baseConf.set("zookeeper.znode.parent", "/hbase-secure");
Job job = Job.getInstance(baseConf, "Test Bulk Load");
HTable table = new HTable(baseConf, "map_data");
HBaseAdmin admin = new HBaseAdmin(baseConf);
HFileOutputFormat2.configureIncrementalLoad(job, table);
Configuration conf = job.getConfiguration();
cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
System.out.println("Finished!!!!!");
} catch (IOException e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
I'm running the appliaction via spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar
When I look into the output directory of my HDFS, it is still empty! I'm using Spark 1.6.3 in a HDP 2.5 platform.
So I have two questions here: Where comes this behavior from (maybe memory problems)? And what is the difference between the yarn-client and yarn-cluster mode (I didn't understand it yet, also the documentation isn't clear to me)? Thanks for your help!

It seems that job doesn't start. Before start the job Spark check available resources. I think available resources are not enough. So try to reduce driver and executor memory, driver and executor cores in your configuration.
Here you can read how to calculate opportune value of resources for executors and driver: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Your job runs in client mode because in client mode drive can use all available resources on the node. But in cluster mode resources are limited.
Difference between cluster and client mode:
Client:
Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.
Cluster:
Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

I found out, that this problem is related to a Kerberos issue! When running the application in yarn-client mode from my Hadoop Namenode the driver is running on that node, where also my Kerberos server is running on. Therefore, the used userpricipal in file /etc/security/keytabs/user.keytab is present on this machine.
When running the app in yarn-cluster, the driver process is started randomly on one of my Hadoop nodes. As I forgot to copy the keyfiles to the other nodes after creating them, the driver processes of course coun't find the keytab file on that local location!
So, to be able to work with Spark in a Kerberized Hadoop Cluster (and even in yarn-cluster mode), you have to copy the needed keytab files of the user who runs the spark-submit command to the corresponding path on all nodes of the cluster!
scp /etc/security/keytabs/user.keytab user#workernode:/etc/security/keytabs/user.keytab
So you should be able to make a kinit -kt /etc/security/keytabs/user.keytab user on each node of the cluster.

What does "Client" exactly mean for Hadoop / HDFS?

I understand the general concept behind it, but I would like more clarification and a clear-cut definition of what a "client" is.
For example, if I just write an hdfs command on the Terminal, is it still a "client" ?

Client in Hadoop refers to the Interface used to communicate with the Hadoop Filesystem. There are different type of Clients available with Hadoop to perform different tasks.
The basic filesystem client hdfs dfs is used to connect to a Hadoop Filesystem and perform basic file related tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.
To perform administrative tasks on HDFS, there is hdfs dfsadmin. For HA related tasks, hdfs haadmin.
There are similar clients available for performing YARN related tasks.
These Clients can be invoked using their respective CLI commands from a node where Hadoop is installed and has the necessary configurations and libraries required to connect to a Hadoop Filesystem. Such nodes are often referred as Hadoop Clients.
For example, if I just write an hdfs command on the Terminal, is it
still a "client" ?
Technically, Yes. If you are able to access the FS using the hdfs command, then the node has the configurations and libraries required to be a Hadoop Client.
PS: APIs are also available to create these Clients programmatically.

Edge nodes are the interface between the Hadoop cluster and the outside network. This node/host will have all the libraries and client components present, as well as current configuration of the cluster to connect to the hdfs.
This thread discusses same

Hadoop YARN clusters - Adding node at runtime

I am working on a solution providing run-time ressources addition to an Hadoop Yarn cluster. The purpose is to handle heavy peaks on our application.
I am not an expert and I need help in order to approve / contest what I understand.
Hadoop YARN
This application an run in a cluster-mode. It provides ressource management (CPU & RAM).
A spark a application, for example, ask for a job to be done. Yarn handles the request and provides an executor computing on the Yarn cluster.
HDFS - Data & Executors
The datas are not shared through executors, so they have to be stored in a file System. In my case : HDFS. That means I will have to run a copy of my spark streaming application in the new server (hadoop node).
I am not sure of this :
The yarn cluster and HDFS are different, writing on HDFS won't write on the new hadoop node local data (because it is not an HDFS node).
As I will only write on HDFS new data from a spark streaming application, creating a new application should not be a problem.
Submit the job to yarn
--- peak, resources needed
Instance new server
Install / configure Hadoop & YARN, making it a slave
Modifying hadoop/conf/slaves, adding it's ip adress (or dns name from host file)
Moddifying dfs.include and mapred.include
On host machine :
yarn -refreshNodes
bin/hadoop dfsadmin -refreshNodes
bin/hadoop mradmin -refreshNodes
Should this work ? refreshQueues sounds not really useful here as it seems to only take care of the process queue.
I am not sure if the running job will increase it's capacity. Another idea is to wait for the new ressources to be available and submit a new job.
Thanks for you help

Hadoop use only master node for processing data

I've setup a Hadoop 2.5 cluster with 1 master node(namenode and secondary namenode and datanode) and 2 slave nodes(datanode).All of the machines use Linux CentOS 7 - 64bit. When I run my MapReduce program (wordcount), I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing.
I've checked the logs from all of the namenode and there is nothing wrong on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager.
Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?
Thank you

Make sure you are mentioaning the IP's Addresses of the daanodes on the Masternode networking files. Also each node in the cluster is supposed to contain IP address of the other machines.
Besides that check the includes file if it contains the relevant datanodes entry onto it or not.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio