In hadoop2 there's timeline service used for yarn and also the MRV2 history server, so what's the difference between them? What's the main purpose of a stand-alone timeline service?
As Hadoop has shifted from MapReduce to other execution engines (Spark, Tez, Flink) the Job History server which was designed with MapReduce in mind no longer worked. The Job History server also only provides stats on completed jobs and has/had scalability issues. The Application Timeline Server(ATS) addressed these problems and introduced the concept of a flow which is a series of jobs being treated as one entity rather than as a set of jobs (MapReduce).
Anyway, that was the original intent. You can read more in the official docs and the JIRAs have several design documents included.
References
The YARN Timeline Server
The YARN Timeline Service v.2
YARN-5355: YARN Timeline Service v.2: alpha 2
YARN-2928: YARN Timeline Service v.2: alpha 1
YARN-1530: [Umbrella] Store, manage and serve per-framework application-timeline data This includes the original application timeline design proposal.
Related
I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.
Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?
My goal is to provide Hadoop jobs statistics web UI for administrative users.
I use HortonWorks Hadoop2 cluster and jobs run on YARN.
From the architecture perspective , I am planning to collect jobs related information ( such as start time, end time, mappers, etc ) from YARN Resource Manager REST API as scheduled cron job >> index them in to elastic search >> show them in Kibana.
I wonder if there is better way to do this.
Have you looked into Ambari? It provides metrics, dashboards, and alerting without having to create the framework from scratch.
Apache Ambari
Ambari provides statistics on an infrastructure level not on job level. So, you need to write a custom code to use yarn-rest API which provides you a JSON response. Based on which you can use the JSON parser and get the exact details. I have written one on python, you can refer to this link-https://dzone.com/articles/customized-alerts-for-hadoop-jobs-using-yarn-rest
http://thelearnguru.com/customized-alerts-for-hadoop-jobs-using-yarn-rest-api
I am supposed to work on cluster mirroring where I have to set up the similar HDFS cluster (same master and slaves) as a existing one and copy the data to the new and then run the same jobs as is.
I have read about falcon as a feed processing and a work flow coordinating tool and it is used for mirroring of HDFS clusters as well. Can someone enlighten me on what is Falcon's role in Hadoop ecosystem and how does it help in mirroring in particular. I am looking here to understand what all facon offers when it is part of my Hadoop eco-system (HDP).
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Falcon replication is asynchronous with delta changes. Recovery is done by running a process and swapping the source and target.
Data loss – Delta data may be lost if the primary cluster is completely shut down
Backup can be scheduled when needed depending on the bandwidth and network availability.
I have been thinking about this question for a while now. I have been trying to compare the performance of hadoop 1 vs yarn by running the basic word count example. I am still unsure about how the same .jar file can be used to execute on both the frameworks. As far as I understand yarn has a different set of api's which it uses to set connection with resource manager, create an application master etc.
So if I develop an application(.jar), can it be run on both the frameworks without any change in code?
Also what could be meaningful parameters to differentiate hadoop vs yarn for a particular application?
Ok, let's clear up some terms here.
Hadoop is the umbrella system that contains the various components needed for distributed storage and processing. I believe the term you're looking for when you say hadoop 1 is MapReduce v1 (MRv1)
MRv1 is a component of Hadoop that includes the job tracker and task trackers. It only relies on HDFS.
YARN is a component of Hadoop that abstracts out the resource management part of MRv1.
MRv2 is the mapreduce application rewritten to run on top of YARN.
So when you're asking if hadoop 1 is interchangeable with YARN, you're probably actually asking if MRv1 is interchangeable with MRv2. And the answer is generally, yes. The Hadoop system knows how to run the same mapreduce application on both mapreduce platforms.
Adding to climbage's answer:
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes an application(.jar), can be run on both the frameworks without any change in code.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.