rerun a black list Hadoop node without stop job running

rerun a black list Hadoop node without stop job running - hadoop

Is there any way for unblacklisting a Hadoop node when the job is running ?
I tired restarting data node but it didn't work.
this happens after four time failure at slave1 with this error:
Error initializing attempt_201311231755_0030_m_000000_0:
java.io.IOException: Expecting a line not the end of stream
at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:306)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:108)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:776)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664)
at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629)

Related

Exception : No alive nodes found in your cluster

I have an issue according to elasticsearch, when I am running this command php artisan index:ambassadors inside docker, it gives me this exception.
**Exception : No alive nodes found in your cluster**
Here is my output.
Exception : No alive nodes found in your cluster
412/4119 [▓▓░░░░░░░░░░░░░░░░░░░░░░░░░░] 10%Exception : No alive nodes found in your cluster
824/4119 [▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░░░░] 20%Exception : No alive nodes found in your cluster
1236/4119 [▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░░░░] 30%Exception : No alive nodes found in your cluster
1648/4119 [▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░░░░░░] 40%Exception : No alive nodes found in your cluster
2472/4119 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░░░] 60%Exception : No alive nodes found in your cluster
2884/4119 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░] 70%Exception : No alive nodes found in your cluster
3296/4119 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░] 80%Exception : No alive nodes found in your cluster
3997/4119 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░] 97%Exception : No alive nodes found in your cluster
4119/4119 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%Exception : No alive nodes found in your cluster
Also I have an error message in my elasticsearch container logs.
Some logging configurations have %marker but don't have %node_name. We will automatically add %node_name to the pattern to ease the migration for users who customize log4j2.properties but will stop this behavior in 7.0. You should manually replace `%node_name` with `[%node_name]%marker ` in these locations:
/usr/share/elasticsearch/config/log4j2.properties.
Is there anyone who faced this issue before?

org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21

I have yarn cluster with spark(1.6.1), hdfs and hive(2.1). My workflows worked fine for few months till this day (without any changes in code / on environments). I started to get errors like this:
org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21
Serialization trace:
outputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
invertedWorkGraph (org.apache.hadoop.hive.ql.plan.SparkWork)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:238)
at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:226)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:745)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:131)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
at org.apache.hadoop.hive.ql.exec.spark.KryoSerializer.deserialize(KryoSerializer.java:49)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:318)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Using hive i can do simple selects, but every other operation which needs spark ends with Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3) in console, and error above in yarn logs.
Now my every hive database is paralyzed (i have few). I was trying to solve this problem whole day, but couldnt do antything (hive restart, yarn node's restarts, changing yarn master).
What do you think causes the problem and how can it be solved?

I figured it out.
After restarting hive-server2 for small period of time instead of getting error: org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 26 i got error: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.hadoop.hive.ql.io.RCFileOutputFormat. With second form it was obvious, that spark executed on node's didn't have some jars on classpath. I don't know the reason, why spark in one moment was unable to load these jars, but after copying them manually to his lib folder on every node and restarting node everything went back to normal.

Error in Accumulo's tablet server when scanning for data

I have a bunch of tables in Accumulo with one master and 2 tablet servers containing a bunch of tables storing millions of records. The problem is that whenever I scan the tables to get a few records out, the tablet server logs keep throwing this error
2015-11-12 04:38:56,107 [hdfs.DFSClient] WARN : Failed to connect to /192.168.250.12:50010 for block, add to deadNodes and continue. java.io.IOException: Got error, status message opReadBlock BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 received exception java.io.IOException: Offset 16320 and length 20 don't match block BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 ( blockLen 0 ), for OP_READ_BLOCK, self=/192.168.250.202:55915, remote=/192.168.250.12:50010, for file /accumulo/tables/1/default_tablet/F0000gne.rf, for pool BP-1881591466-192.168.1.111-1438767154643 block 1073773956_33167
java.io.IOException: Got error, status message opReadBlock BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 received exception java.io.IOException: Offset 16320 and length 20 don't match block BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 ( blockLen 0 ), for OP_READ_BLOCK, self=/192.168.250.202:55915, remote=/192.168.250.12:50010, for file /accumulo/tables/1/default_tablet/F0000gne.rf, for pool BP-1881591466-192.168.1.111-1438767154643 block 1073773956_33167
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:456)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:424)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:818)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:618)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:697)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at org.apache.accumulo.core.file.rfile.bcfile.Utils$Version.<init>(Utils.java:264)
at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.<init>(BCFile.java:823)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.init(CachableBlockFile.java:246)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:257)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$100(CachableBlockFile.java:137)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:209)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:368)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:137)
at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:843)
at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:69)
at org.apache.accumulo.tserver.tablet.Compactor.openMapDataFiles(Compactor.java:279)
at org.apache.accumulo.tserver.tablet.Compactor.compactLocalityGroup(Compactor.java:322)
at org.apache.accumulo.tserver.tablet.Compactor.call(Compactor.java:214)
at org.apache.accumulo.tserver.tablet.Tablet._majorCompact(Tablet.java:1976)
at org.apache.accumulo.tserver.tablet.Tablet.majorCompact(Tablet.java:2093)
at org.apache.accumulo.tserver.tablet.CompactionRunner.run(CompactionRunner.java:44)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
I think it is more of a HDFS related issue as opposed to an Accumulo one, so I checked the logs of the datanode and found the same message,
Offset 16320 and length 20 don't match block BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 ( blockLen 0 ), for OP_READ_BLOCK, self=/192.168.250.202:55915, remote=/192.168.250.12:50010, for file /accumulo/tables/1/default_tablet/F0000gne.rf, for pool BP-1881591466-192.168.1.111-1438767154643 block 1073773956_33167
But as INFO in the logs. What I don't understand is that why am I getting this error.
I can see that the pool name of the file (BP-1881591466-192.168.1.111-1438767154643) that I am trying to access contains a IP address (192.168.1.111) which does not match the IP address of any of the servers (self and remote). Actually, 192.168.1.111 was the old IP address of the Hadoop Master server, but I had changed it. I use domain names instead of IP addresses so the only place where I made the changes were in the host files of the machines in the cluster. None of the Hadoop/Accumulo configurations use IP addresses. Does anyone know what the issue is here? I have spent days on it and still am not able to figure it out.

The error you are receiving indicates that Accumulo cannot read part of one of its files from HDFS. The NameNode is reporting that a block is located on a particular DataNode (in your case, 192.168.250.12). However, when Accumulo attempts to read from that DataNode, it fails.
This likely indicates a corrupt block in HDFS, or a temporary network issue. You can try to run hadoop fsck / (the exact command may vary, depending on version) to perform a health check of HDFS.
Also, the IP address mismatch in the DataNode appears to indicate that the DataNode is confused about the HDFS pool it is a part of. You should restart that DataNode after double-checking its configuration, DNS, and /etc/hosts for any anomolies.

my hadoop job 252 hours later died(tasks then killed)

I had 81,068 tasks complete but then 11,799 failed and only 12 were killed. They SEEM to all failed from
2013-09-10 03:07:36,316 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201308301539_0002_m_083001_0: Error initializing attempt_201308301539_0002_m_083001_0:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201308301539_0002/work in any of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1817)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1933)
at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:830)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:824)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664)
at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629)
At this point, I am just looking for guidance on how I can debug this before I re-run this again. For some reason out in the cluster, it looks like all the files are deleted though I thought hadoop M/R only deleted successfull task logs????
Anyone have some advice/ideas on how to debug this further?
It looks like all the default directories for map/reduce are used... /tmp/hadoop-hduser for my hduser.
I have seen stuff on /etc/hosts but then I don't get why 81,000 tasks succeeded before finally failing???
I am using the web interface to get some of this information of course and some logs where hadoopinstalled/logs
thanks,
Dean

Generating job and topology traces from history folder of multinode cluster using Rumen

I have a single node cluster from which i got logs and gave input TraceBuilder and it works.
I have grouped 5 node cluster under default rack and got logs. Here job and topology traces are generated properly.
I have set up 5 node cluster with each of them mapped to different racks.
I have hadoop-0.20.2 set up on my Eclipse Helios. So, i ran Tracebuilder using
Main Class: org.apache.hadoop.tools.rumen.TraceBuilder
I ran some jobs on cluster and used copy of /usr/local/hadoop/logs/history folder of master node as input to TraceBuilder.
Arguments: /home/arun/job.json /home/arun/topology.json /home/ubuntu/Documents/testlog
But i get
11/12/16 12:02:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11/12/16 12:02:38 WARN rumen.TraceBuilder: TraceBuilder got an error while processing the [possibly virtual] file master_1324011575958_job_201112161029_0001_hduser_word+count within Path file:/home/ubuntu/Documents/testlog/master_1324011575958_job_201112161029_0001_hduser_word+count
java.lang.NullPointerException
at org.apache.hadoop.tools.rumen.JobBuilder.processTaskAttemptFinishedEvent(JobBuilder.java:492)
at org.apache.hadoop.tools.rumen.JobBuilder.process(JobBuilder.java:149)
at org.apache.hadoop.tools.rumen.TraceBuilder.processJobHistory(TraceBuilder.java:310)
at org.apache.hadoop.tools.rumen.TraceBuilder.run(TraceBuilder.java:264)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
at org.apache.hadoop.tools.rumen.TraceBuilder.main(TraceBuilder.java:142)
.....................
It generates job trace json file but the fields like hostname and location are "null" in it and the topology trace json file doesn't have 5 node's info and is like this :
{
"name" : "<root>",
"children" : [ ]
}
Can anyone help me out?

This error occurs because none expected input file was found on input directory.
The input directory must to contain job files, for example: job_201205192032_0006_conf.xml. These files are stored inside the logs/history folder, but under some directories generated in accord with the job execution and execution date

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

rerun a black list Hadoop node without stop job running - hadoop

Related

Exception : No alive nodes found in your cluster

org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21

Error in Accumulo's tablet server when scanning for data

my hadoop job 252 hours later died(tasks then killed)

Generating job and topology traces from history folder of multinode cluster using Rumen

Categories

Resources