When I run hadoop, I get many info massages. So I want to remove unuseful, such as INFO streaming.PipeMapRed, INFO mapred.MapTask etc. And retain the most important - INFO mapreduce.Job.
So how to do it?enter image description here
There is an Interface in org.apache.hadoop.mapred called Reporter which provides report progress and update counters, status information etc. If you have access to the code, You can eliminate INFO messages and report imporant ones with setStatus method.
Related
I have a Hadoop job that is taking a very long time to initialize when fed a large number of input files, and I'm not sure why. The job will find all of the nodes and files within a few seconds, regardless of how many files are used, but takes significant time (minutes) to determine the number of splits if given 10,000 files. When I run the job as a different user, the job will determine the number of splits almost immediately.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D 'mapreduce.job.name=...'
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.11.0.jar] /tmp/streamjob4556879591438635545.jar tmpDir=null
17/08/07 22:01:40 INFO client.RMProxy: Connecting to ResourceManager at jobtracker-dev.la.prod.factual.com/10.20.103.16:8032
...
17/08/07 22:01:41 INFO security.TokenCache: Got dt for hdfs://dev; Kind: HDFS_DELEGATION_TOKEN....
17/08/07 22:01:41 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
17/08/07 22:01:41 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 674c65bbf0f779edc3e00a00c953b121f1988fe1]
17/08/07 22:01:41 INFO mapred.FileInputFormat: Total input paths to process : 10000
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/211/...
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/210/...
...
<LONG PAUSE>
...
17/08/07 22:31:39 INFO mapreduce.JobSubmitter: number of splits:10000
This is not a lot of information, obviously, but does anyone have an idea what might be going on?
Time taken depends of so many parameter, for starting you can check your cluster capacity, and yarn configuration specifically.
IF you have 10k splits that mean AM coordination with tasks will take significant amount of time , remember hadoop is built for big files processing not small {large number}files.
Do check your hdfs block size as well , and how much you are putting.
Make sure if you are distributed mode establish password less connection with your data nodes.
For specifically"When I run the job as a different user, the job will determine the number of splits almost immediately." This is not HADOOP capacity issue, check your configuration properly. If possible use ambari to manage your cluster if you have enough budget to do so.
I am using Confluent's HDFS Connector to write streamed data to HDFS. I followed the user manual and quick start and setup my Connector.
It works properly when i consume only one topic.
My property file looks like this
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test_topic1
hdfs.url=hdfs://localhost:9000
flush.size=30
When i add more than one topic, i see it continuously committing offsets and i do not see it writing the committed messages.
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=2
topics=test_topic1,test_topic2
hdfs.url=hdfs://localhost:9000
flush.size=30
I tried with tasks.max with 1 and 2.
I continuously get Committing offsets logged as below
[2016-10-26 15:21:30,990] INFO Started recovery for topic partition test_topic1-0 (io.confluent.connect.hdfs.TopicPartitionWriter:193)
[2016-10-26 15:21:31,222] INFO Finished recovery for topic partition test_topic1-0 (io.confluent.connect.hdfs.TopicPartitionWriter:208)
[2016-10-26 15:21:31,230] INFO Started recovery for topic partition test_topic2-0 (io.confluent.connect.hdfs.TopicPartitionWriter:193)
[2016-10-26 15:21:31,236] INFO Finished recovery for topic partition test_topic2-0 (io.confluent.connect.hdfs.TopicPartitionWriter:208)
[2016-10-26 15:21:35,155] INFO Reflections took 6962 ms to scan 249 urls, producing 11712 keys and 77746 values (org.reflections.Reflections:229)
[2016-10-26 15:22:29,226] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:23:29,227] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:24:29,225] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:25:29,224] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
When i gracefully stop the service (Ctrl+C), i see it removing the tmp files.
What am i doing wrong? What is the proper way to do it?
Appreciate any suggestions on this.
I've kept stumbling over the same problem you've mentioned here for the past month or so and I couldn't get to the bottom of it, until today when I've upgraded to confluent 3.1.1 and stuff started working as expected...
This is how I roll
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=5
topics=accounts,contacts,users
hdfs.url=hdfs://localhost:9000
flush.size=1
hive.metastore.uris=thrift://localhost:9083
hive.integration=true
schema.compatibility=BACKWARD
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner
locale=en-us
timezone=UTC
I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.
However, there seems to be a long period of inactivity between jobs.
This is the CPU use:
And this is the network:
Notice the gap between each column, it is almost the same size as the activity column!
At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.
Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.
Thanks.
EDIT: Looking at the logs, this is the part where it seems to take an hour of...doing nothing?
15/04/27 01:13:13 INFO storage.DiskBlockManager: Created local directory at /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1429892010439_0020/spark-c570e510-934c-4510-a1e5-aa85d407b748
15/04/27 01:13:13 INFO storage.MemoryStore: MemoryStore started with capacity 4.9 GB
15/04/27 01:13:13 INFO netty.NettyBlockTransferService: Server created on 37151
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/27 01:13:13 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-0-3-12.ec2.internal:41461/user/HeartbeatReceiver
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
15/04/27 02:30:45 INFO executor.Executor: Running task 77251.0 in stage 0.0 (TID 0)
15/04/27 02:30:45 INFO executor.Executor: Running task 77258.0 in stage 0.0 (TID 7)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
15/04/27 02:30:45 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 8)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
15/04/27 02:30:45 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 15)
15/04/27 02:30:45 INFO broadcast.TorrentBroadcast: Started reading broadcast variable
Notice at 01:13:13, it just hangs there until 20:30:45.
I found the issue. The problem was in the way I was calling pulling from S3.
We have our data in S3 separated by a date pattern as in s3n://bucket/2015/01/03/10/40/actualData.txt for the data from 2015-01-03 at 10:40
So when we want to run the batch process on the whole set, we call sc.textFiles("s3n://bucket/*/*/*/*/*/*").
BUT that is bad. In retrospect, this makes sense; for each star (*), Spark needs to get all of the files in that "directory", and then get all of the files in the directory under that. A single month has about 30 files and each day has 24 files, and each of those has 60. So the above pattern would call a "list files" on each star AND the call list files on the files returned, all the way down to the minutes! This is so that is can eventually get all of the **/acutalData.txt files and then union all of their RDDs.
This, of course, is really slow. So the answer was to build these paths in code (a list of strings for all the dates. In our case, all possible dates can be determined) and reducing them into a comma-separated string that can be passed into textFiles.
If in your case you can't determine all of the possible paths, consider either restructuring your data or build as much as possible of the paths and only call * towards the end of the path, or use the AmazonS3Client to get all the keys using the list-objects api (which allows you to get ALL keys in a bucket with a prefix very quickly) and then pass them as comma-separated string into textFiles. It will still make a list Status call for each file and it will still be serial, but there will be a lot less calls.
However, all of these solutions just slow down the inevitable; as more and more data gets built, more and more list status calls will be made serially. The root of the issue seems to the that sc.textFiles(s3n://) pretends that s3 is a file system, which is not. It is a key-value store. Spark (and Hadoop) need a different way of dealing with S3 (and possibly other key-value stores) that don't assume a file system.
In my job tracker logs i see the values like
2015-01-27 10:04:04,013 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- detailed locations:
M: CDR_IP[1,9],CDR_IP[-1,-1],CDR[17,6],cdrSMS[38,15],
grpdCdrSMS[47,13],1-11[47,22],cdrMMS[39,14],grpdCdrMMS[53,13],1-12[53,22],
cdrCALL[40,14],grpdCdrCALL[59,14],1-13[59,23],cdrSMSD[41,14],
grpdCdrSMSD[64,14],1-14[64,23],cdrMMSD[42,14],grpdCdrMMSD[69,14],1-15[69,23],
cdrCALLD[43,14],grpdCdrCALLD[74,15],1-16[74,24]
C: grpdCdrSMS[47,13],
1-11[47,22],grpdCdrMMS[53,13],1-12[53,22],grpdCdrCALL[59,14],1-13[59,23],
grpdCdrSMSD[64,14],1-14[64,23],grpdCdrMMSD[69,14],1-15[69,23],
grpdCdrCALLD[74,15],1-16[74,24]
R: grpdCdrSMS[47,13],UNIONALL[-1,-1],
grpdCdrMMS[53,13],UNIONALL[-1,-1],grpdCdrCALL[59,14],UNIONALL[-1,-1],
grpdCdrSMSD[64,14],UNIONALL[-1,-1],grpdCdrMMSD[69,14],UNIONALL[-1,-1],
grpdCdrCALLD[74,15],UNIONALL[-1,-1]
what do values like
R: grpdCdrSMS[47,13]
M: CDR_IP[1,9]
UNIONALL[-1,-1]
signify.
These look like Pig relations (bags, tuples and the like), with their respective set values (except UNIONALL, obviously, which is an operation). Without more information it is difficult to say exactly.
I am getting a race condition warning while running multiple imports simultaneously to openTSDB. Following is one of the log sequences showing the race condition.
2013-08-21 14:34:24,745 INFO [main] UniqueId: Creating an ID for
kind='tagv' name='25447'
2013-08-21 14:34:24,747 INFO [main] UniqueId: Got ID=307 for
kind='tagv' name='25447'
2013-08-21 14:34:24,752 WARN [main] UniqueId: Race condition: tried
to assign ID 307 to tagv:25447, but CAS failed on
PutRequest(table="tsdb-uid", key="25447", family="id",
qualifiers=["tagv"], values=["\x00\x013"],
timestamp=9223372036854775807, lockid=-1, durable=true,
bufferable=true, attempt=0, region=null), which indicates this UID
must have been allocated concurrently by another TSD. So ID 307 was
leaked.
Following questions I have:
Since it is a warning, is it that the record is actually written and not skipped?
At the end it says, 'ID 307 was leaked', so is some other ID assigned to the record?
How to verify that the said record has been written in HBase's table named 'tsdb-uid'? (HBase shell commands, I tried a few but in vain).
This just means that a UID was allocated for nothing, but otherwise everything is fine. If you are worried about the state of your tsdb-uid table, you could run the tsdb uid fsck command, and it will probably report that some UIDs are allocated but unused.
If you see the message only occasionally, you can ignore it. If you see it a lot, then the only undesirable consequence is that you're burning through the UID space faster than you should be, so you may run out of UIDs sooner (there are 16777215 possible UIDs for each of: metric names, tag names, tag values).