Understanding Job Tracker Metrics - hadoop

In my job tracker logs i see the values like
2015-01-27 10:04:04,013 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- detailed locations:
M: CDR_IP[1,9],CDR_IP[-1,-1],CDR[17,6],cdrSMS[38,15],
grpdCdrSMS[47,13],1-11[47,22],cdrMMS[39,14],grpdCdrMMS[53,13],1-12[53,22],
cdrCALL[40,14],grpdCdrCALL[59,14],1-13[59,23],cdrSMSD[41,14],
grpdCdrSMSD[64,14],1-14[64,23],cdrMMSD[42,14],grpdCdrMMSD[69,14],1-15[69,23],
cdrCALLD[43,14],grpdCdrCALLD[74,15],1-16[74,24]
C: grpdCdrSMS[47,13],
1-11[47,22],grpdCdrMMS[53,13],1-12[53,22],grpdCdrCALL[59,14],1-13[59,23],
grpdCdrSMSD[64,14],1-14[64,23],grpdCdrMMSD[69,14],1-15[69,23],
grpdCdrCALLD[74,15],1-16[74,24]
R: grpdCdrSMS[47,13],UNIONALL[-1,-1],
grpdCdrMMS[53,13],UNIONALL[-1,-1],grpdCdrCALL[59,14],UNIONALL[-1,-1],
grpdCdrSMSD[64,14],UNIONALL[-1,-1],grpdCdrMMSD[69,14],UNIONALL[-1,-1],
grpdCdrCALLD[74,15],UNIONALL[-1,-1]
what do values like
R: grpdCdrSMS[47,13]
M: CDR_IP[1,9]
UNIONALL[-1,-1]
signify.

These look like Pig relations (bags, tuples and the like), with their respective set values (except UNIONALL, obviously, which is an operation). Without more information it is difficult to say exactly.

Related

Hadoop Streaming MapReduce slow finding files

I have a Hadoop job that is taking a very long time to initialize when fed a large number of input files, and I'm not sure why. The job will find all of the nodes and files within a few seconds, regardless of how many files are used, but takes significant time (minutes) to determine the number of splits if given 10,000 files. When I run the job as a different user, the job will determine the number of splits almost immediately.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D 'mapreduce.job.name=...'
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.11.0.jar] /tmp/streamjob4556879591438635545.jar tmpDir=null
17/08/07 22:01:40 INFO client.RMProxy: Connecting to ResourceManager at jobtracker-dev.la.prod.factual.com/10.20.103.16:8032
...
17/08/07 22:01:41 INFO security.TokenCache: Got dt for hdfs://dev; Kind: HDFS_DELEGATION_TOKEN....
17/08/07 22:01:41 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
17/08/07 22:01:41 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 674c65bbf0f779edc3e00a00c953b121f1988fe1]
17/08/07 22:01:41 INFO mapred.FileInputFormat: Total input paths to process : 10000
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/211/...
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/210/...
...
<LONG PAUSE>
...
17/08/07 22:31:39 INFO mapreduce.JobSubmitter: number of splits:10000
This is not a lot of information, obviously, but does anyone have an idea what might be going on?
Time taken depends of so many parameter, for starting you can check your cluster capacity, and yarn configuration specifically.
IF you have 10k splits that mean AM coordination with tasks will take significant amount of time , remember hadoop is built for big files processing not small {large number}files.
Do check your hdfs block size as well , and how much you are putting.
Make sure if you are distributed mode establish password less connection with your data nodes.
For specifically"When I run the job as a different user, the job will determine the number of splits almost immediately." This is not HADOOP capacity issue, check your configuration properly. If possible use ambari to manage your cluster if you have enough budget to do so.

remove hadoop info streaming.PipeMapRed and others

When I run hadoop, I get many info massages. So I want to remove unuseful, such as INFO streaming.PipeMapRed, INFO mapred.MapTask etc. And retain the most important - INFO mapreduce.Job.
So how to do it?enter image description here
There is an Interface in org.apache.hadoop.mapred called Reporter which provides report progress and update counters, status information etc. If you have access to the code, You can eliminate INFO messages and report imporant ones with setStatus method.

What is Apache Spark doing before a job start

I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.
However, there seems to be a long period of inactivity between jobs.
This is the CPU use:
And this is the network:
Notice the gap between each column, it is almost the same size as the activity column!
At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.
Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.
Thanks.
EDIT: Looking at the logs, this is the part where it seems to take an hour of...doing nothing?
15/04/27 01:13:13 INFO storage.DiskBlockManager: Created local directory at /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1429892010439_0020/spark-c570e510-934c-4510-a1e5-aa85d407b748
15/04/27 01:13:13 INFO storage.MemoryStore: MemoryStore started with capacity 4.9 GB
15/04/27 01:13:13 INFO netty.NettyBlockTransferService: Server created on 37151
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/27 01:13:13 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-0-3-12.ec2.internal:41461/user/HeartbeatReceiver
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
15/04/27 02:30:45 INFO executor.Executor: Running task 77251.0 in stage 0.0 (TID 0)
15/04/27 02:30:45 INFO executor.Executor: Running task 77258.0 in stage 0.0 (TID 7)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
15/04/27 02:30:45 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 8)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
15/04/27 02:30:45 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 15)
15/04/27 02:30:45 INFO broadcast.TorrentBroadcast: Started reading broadcast variable
Notice at 01:13:13, it just hangs there until 20:30:45.
I found the issue. The problem was in the way I was calling pulling from S3.
We have our data in S3 separated by a date pattern as in s3n://bucket/2015/01/03/10/40/actualData.txt for the data from 2015-01-03 at 10:40
So when we want to run the batch process on the whole set, we call sc.textFiles("s3n://bucket/*/*/*/*/*/*").
BUT that is bad. In retrospect, this makes sense; for each star (*), Spark needs to get all of the files in that "directory", and then get all of the files in the directory under that. A single month has about 30 files and each day has 24 files, and each of those has 60. So the above pattern would call a "list files" on each star AND the call list files on the files returned, all the way down to the minutes! This is so that is can eventually get all of the **/acutalData.txt files and then union all of their RDDs.
This, of course, is really slow. So the answer was to build these paths in code (a list of strings for all the dates. In our case, all possible dates can be determined) and reducing them into a comma-separated string that can be passed into textFiles.
If in your case you can't determine all of the possible paths, consider either restructuring your data or build as much as possible of the paths and only call * towards the end of the path, or use the AmazonS3Client to get all the keys using the list-objects api (which allows you to get ALL keys in a bucket with a prefix very quickly) and then pass them as comma-separated string into textFiles. It will still make a list Status call for each file and it will still be serial, but there will be a lot less calls.
However, all of these solutions just slow down the inevitable; as more and more data gets built, more and more list status calls will be made serially. The root of the issue seems to the that sc.textFiles(s3n://) pretends that s3 is a file system, which is not. It is a key-value store. Spark (and Hadoop) need a different way of dealing with S3 (and possibly other key-value stores) that don't assume a file system.

Lease mismatch LeaseExpiredException

I have seen some posts on this topic, but I could not figure the fix to my problem. I am using Hadoop version Hadoop 2.0.0-cdh4.2.0, and java version "1.7.0_09-icedtea". I am running a program that utilizes counters to control interations in a simple mapreduce example. I also employ sequence files for communicating data. The code is simple: It starts with a number, say, 3. The mapper doesn't modify the number, but simply transmits the value; the reducer reduces the number by 1 each time it runs. The counter is increamented, if the number is greater than zero. Eventually, the nummber must decrease to 0, and the program should stop at that point. However, I always get the following error after the first iteration (during the second iteration):
" Running job: job_201304151408_0181
13/05/10 18:55:54 INFO mapred.JobClient: map 0% reduce 0%
13/05/10 18:56:03 INFO mapred.JobClient: map 100% reduce 0%
13/05/10 18:56:10 INFO mapred.JobClient: map 100% reduce 33%
13/05/10 18:56:11 INFO mapred.JobClient: Task Id : attempt_201304151408_0181_r_000002_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /user/harsha/iterone/import/data owned by DFSClient_NONMAPREDUCE_-592566041_1 but is accessed by DFSClient_NONMAPREDUCE_-965911637_1"
Can anyone please help? Thank you.
Regards...
Usually Lease Mismatch happens if we are trying to write into a file which is not existing.
Please check if /user/harsha/iterone/import/data is in hdfs.
Is that a file?

Hadoop Mapreduce detailed task status queries

I want to write a 3rd party frontend to hadoop mapreduce which needs to query mapreduce on some information and statistics.
Right now I'm able to use hadoop job to query jobs and the map and reduce completion percentages, along with counters, e.g.:
# hadoop job -status job_201212170023_0127
Job: job_201212170023_0127
map() completion: 0.6342382
reduce() completion: 0.0
Counters: 28
Job Counters
SLOTS_MILLIS_MAPS=4537
...
What I would also like are the numbers of each task, as used by the visualisation within the job tracker, i.e.:
I am able to list all the mappers...
# hadoop job -list-attempt-ids job_201212170023_0127 map running
attempt_201212170023_0127_m_000000_0
attempt_201212170023_0127_m_000001_0
attempt_201212170023_0127_m_000002_0
...
..but how would I get the percentage of each of these tasks? Ideally I would want something like this:
# hadoop job -task-status attempt_201212170023_0127_m_000000_0
completion: 0.6342382
start: 2012-12-18T12:23:34Z
... etc.
The current solution would be to scrape the web interface, but I'm not a fan of this if it is at all possible to use the command line output.

Resources