AWS Glue JOB: Command failed with error code 1 - parquet

We have python script for our glue job and the triggered runs for every one hour to convert the JSON S3 to parquet files and we are getting following issue..the following logs are taken from cloudwatch for the jobId
:
CoarseGrainedExecutorBackend: Driver commanded a shutdown
18/06/25 08:54:03 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-31-34-26.ec2.internal/172.31.34.26:36135 is closed
18/06/25 08:54:03 ERROR OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-172-31-34-26.ec2.internal/172.31.34.26:36135 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
18/06/25 08:54:03 INFO CoarseGrainedExecutorBackend: Driver from 172.31.47.44:45951 disconnected during shutdown
18/06/25 08:54:03 INFO CoarseGrainedExecutorBackend: Driver from 172.31.47.44:45951 disconnected during shutdown
18/06/25 08:54:03 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
18/06/25 08:54:03 INFO MemoryStore: MemoryStore cleared
18/06/25 08:54:03 INFO BlockManager: BlockManager stopped
18/06/25 08:54:03 INFO ShutdownHookManager: Shutdown hook called

Open Glue> Jobs > Edit your Job> Script libraries and job parameters (optional) > Job parameters near the bottom
Set the following: key: --conf value: spark.yarn.executor.memoryOverhead=1024 spark.driver.memory=10g

There is no way to fix this issue,AWS Glue has so many enhancements that are to be done.
As of now we split our folder into multiple sub folders and split our glue job to two to handle this scenario,and also the memory overhead was not being considered when we give our own script option.

You need to reduce the number of files that you are storing into the S3 bucket by accumulating the data into a single big file,glue is efficient on bigger files

Related

Kafka fails on start due to topic not being loaded

I have setup Kafka server and a zookeeper in a windows machine with help from here. I was successfully able to setup a topic - MTETest as in below log, produce and consume messages to this topic.
On trying to stop and start Kafka and Zookeeper using the batch files that came with installation in a adminitrator command prompt, I am facing a problem that the kafka server is unable to start with below message -
[2017-11-30 21:26:24,601] ERROR There was an error in one of the
threads during logs loading: java.nio.file.FileSystemException:
C:SourceKafkakafka_2.11-0.11.0.1\MTETest-0\00000000000000000000.timeindex:
The process cannot access the file because it is being used by another
process. (kafka.log.LogManager) [2017-11-30 21:26:24,603] FATAL
[Kafka Server 0], Fatal error during KafkaServer startup. Prepare to
shutdown (kafka.server.KafkaServer) java.nio.file.FileSystemException:
C:SourceKafkakafka_2.11-0.11.0.1\MTETest-0\00000000000000000000.timeindex:
The process cannot access the file because it is being used by another
process.
at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165) at
kafka.log.Log$$anonfun$loadSegmentFiles$3.apply(Log.scala:318) at
kafka.log.Log$$anonfun$loadSegmentFiles$3.apply(Log.scala:279) at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.Log.loadSegmentFiles(Log.scala:279) at
kafka.log.Log.loadSegments(Log.scala:383) at
kafka.log.Log.(Log.scala:186) at
kafka.log.Log$.apply(Log.scala:1609) at
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) [2017-11-30 21:26:24,606]
WARN Found a corrupted index file due to requirement failed: Corrupt
index found, index file
(C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.index)
has non-zero size but the last offset is 0 which is no larger than the
base offset 0.}. deleting
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.timeindex,
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.index,
and
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.txnindex
and rebuilding index... (kafka.log.Log) [2017-11-30 21:26:24,609] INFO
[Kafka Server 0], shutting down (kafka.server.KafkaServer) [2017-11-30
21:26:24,613] INFO Terminate ZkClient event thread.
(org.I0Itec.zkclient.ZkEventThread) [2017-11-30 21:26:24,615] WARN
Found a corrupted index file due to requirement failed: Corrupt index
found, index file
(C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.index)
has non-zero size but the last offset is 0 which is no larger than the
base offset 0.}. deleting
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.timeindex,
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.index,
and
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.txnindex
and rebuilding index... (kafka.log.Log) [2017-11-30 21:26:24,616] INFO
Session: 0x1600d98747a0001 closed (org.apache.zookeeper.ZooKeeper)
[2017-11-30 21:26:24,623] INFO EventThread shut down for session:
0x1600d98747a0001 (org.apache.zookeeper.ClientCnxn) [2017-11-30
21:26:24,625] INFO [Kafka Server 0], shut down completed
(kafka.server.KafkaServer) [2017-11-30 21:26:24,626] FATAL Exiting
Kafka. (kafka.server.KafkaServerStartable) [2017-11-30 21:26:24,628]
INFO [Kafka Server 0], shutting down (kafka.server.KafkaServer)
I have tried changing the setting - delete.topic.enable to true as per suggestion in a similar question here in kafka server.properties, but it did not help. Also, I did not open the topic or its related files manually. Anyone faced this issue, please help. Is this problem specific to windows?
It loads successfully when I delete the topic and its related physical folders that are created by Kafka, but it is not the right thing. Please suggest the correct solution.
Thanks.
According to the error message:
The process cannot access the file because it is being used by another process.
you have another process already using this file and it is preventing Kafka from starting. See https://serverfault.com/questions/1966/how-do-you-find-what-process-is-holding-a-file-open-in-windows for finding the process

If first attemp to reduce faills (network connection issues), the subsequent reduce attempts (retry) will fail because the output file already exists

I have mapreduce jobs failing big on Amazon EMR because if the first attempt fails to copy results to S3, the file (probably partial) will be created and subsequent reduce attempts will refuse write on a file that already exists.
The first attempt log:
014-11-30 06:56:19,774 INFO [main] com.amazonaws.latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: null; Request ID: removed), S3 Extended Request ID: removed=], ServiceName=[Amazon S3], AWSErrorCode=[null], AWSRequestID=[removed], ServiceEndpoint=[https://devel.rui.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=0, ClientExecuteTime=[130.087], HttpRequestTime=[118.72], HttpClientReceiveResponseTime=[32.585], RequestSigningTime=[0.646], HttpClientSendRequestTime=[0.835],
2014-11-30 06:56:19,803 INFO [main] com.amazonaws.latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: null; Request ID: removed), S3 Extended Request ID: 1removed=], ServiceName=[Amazon S3], AWSErrorCode=[null], AWSRequestID=[removed], ServiceEndpoint=[https://removed.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[27.899], HttpRequestTime=[26.898], HttpClientReceiveResponseTime=[9.405], RequestSigningTime=[0.559], HttpClientSendRequestTime=[1.016],
2014-11-30 06:56:19,939 INFO [main] com.amazonaws.latency: StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[removed], ServiceEndpoint=[https://removedi.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[127.219], HttpRequestTime=[20.791], HttpClientReceiveResponseTime=[15.467], RequestSigningTime=[0.391], ResponseProcessingTime=[82.617], HttpClientSendRequestTime=[0.955],
2014-11-30 06:56:19,999 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
A retry attempt log (the all look the same):
RequestSigningTime=[0.663], ResponseProcessingTime=[12.466], HttpClientSendRequestTime=[0.832],
2014-11-30 07:23:56,526 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child :
java.io.IOException: File already
exists:s3n://removed/removed/part-r-00005.gz
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:615)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:891)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:788)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:169)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
The funny thing is that if I open the partfiile0005.gz it has things inside and is the format that is supposed to be.
Any ideas, how to solve this (and how to do it):
a) increase the deal with the latency (eg. increase the timeout)
b) the retry to delete the existing file if already exists.
You can modify your job to write output to a temporary directory that is named with a jobId or timestamp for uniqueness, then when processing is complete move the contents to your desired output location. That way if something goes wrong while processing after having written partial output, your desired output directory isn't affected. This also means that you wont accidentally read that partial output from the failed job.

While running a topology in storm we are getting error like this

While running a topology in storm we are getting error like this,
8983 [Thread-6] INFO com.netflix.curator.framework.imps.CuratorFrameworkImpl -
Starting
9144 [main] INFO **backtype.storm.daemon.nimbus** - Shutting down master
9199 [Thread-6-EventThread] INFO backtype.storm.zookeeper - Zookeeper state upd
ate: :connected:none
9241 [main] INFO backtype.storm.daemon.nimbus - Shut down master
9273 [Thread-6] INFO com.netflix.curator.framework.imps.CuratorFrameworkImpl -
Starting
9306 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - EndOfStreamException: Unable to read additional data from cli
ent sessionid 0x143af55728d0003, likely client has closed socket
9354 [main] INFO backtype.storm.daemon.supervisor - Shutting down c094c3b1-a378
-4c4f-af35-9278647c217a:4beddc09-4675-4fb9-8bdc-9cf5013ce9ca
9358 [main] INFO backtype.storm.daemon.supervisor - Shut down c094c3b1-a378-4c4
f-af35-9278647c217a:4beddc09-4675-4fb9-8bdc-9cf5013ce9ca
9361 [main] INFO **backtype.storm.daemon.superviso**r - Shutting down supervisor c0
94c3b1-a378-4c4f-af35-9278647c217a
9364 [Thread-5] INFO **backtype.storm.event** - Event manager interrupted
9369 [Thread-6] INFO backtype.storm.event - Event manager interrupted
9425 [main] INFO **backtype.storm.daemon.supervisor** - Shutting down supervisor 38
6d8d71-c9b5-4b51-bd6e-f9f605034ea0
9428 [Thread-8] INFO backtype.storm.event - Event manager interrupted
9429 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - EndOfStreamException: Unable to read additional data from cli
ent sessionid 0x143af55728d0007, likely client has closed socket
9429 [Thread-9] INFO backtype.storm.event - Event manager interrupted
9473 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - EndOfStreamException: Unable to read additional data from cli
ent sessionid 0x143af55728d0009, likely client has closed socket
9476 [main] INFO backtype.storm.testing - Shutting down in process zookeeper
9503 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - Ignoring exception
**java.nio.channels.ClosedChannelException**: null
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.jav
a:211) ~[na:1.7.0_03]
at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.j
ava:242) ~[zookeeper-3.3.3.jar:3.3.3-1073969]
9510 [main] INFO **backtype.storm.testing** - Done shutting down in process zookeep
er
9513 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\c9b1bc1a-a950-4098-af77-f81a4d2b112f
9520 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\7e75c468-18ea-4787-a4ac-496fb108db71
9527 [main] INFO backtype.storm.testing - Unable to delete file: C:\Users\sowmi
ya\AppData\Local\Temp\7e75c468-18ea-4787-a4ac-496fb108db71\version-2\log.1
9529 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\fa7b3c9b-ac93-4090-b9e2-63f10019e61f
9543 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\55f1fd11-508e-43bb-b340-0d9b79f3af33
9579 [Thread-6-EventThread] INFO com.netflix.curator.framework.state.Connection
StateManager - State change: SUSPENDED
9580 [ConnectionStateManager-0] WARN com.netflix.curator.framework.state.Connec
tionStateManager - There are no ConnectionStateListeners registered.
9583 [Thread-6-EventThread] WARN backtype.storm.cluster - Received event :disco
nnected::none: with disconnected Zookeeper.
11232 [Thread-6-SendThread(localhost:2000)] WARN org.apache.zookeeper.ClientCnx
n - Session 0x143af55728d000b for server null, unexpected error, closing socket
connection and attempting reconnect
**java.net.ConnectException: Connection refused: no further information**
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.7.0_0
3]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701
) ~[na:1.7.0_03]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
~[zookeeper-3.3.3.jar:3.3.3-1073969]
13992 [Thread-6-SendThread(localhost:2000)] WARN org.apache.zookeeper.ClientCnx
n - Session 0x143af55728d000b for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.7.0_0
3]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701
) ~[na:1.7.0_03]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
Whwn we are trying to run the topology jar file all the operation like nimbus,zookeeper and supervisor process going to dead.please help us to know why this is happened.
Please help us to rectify this error and help to proceed further.
Thank you,
Sowmiya
Priya
This looks like a zookeeper issue. It looks like your processes are not being able to connect to zookeeper. Can't say more without more information.

HBase distributed log-splitting keeps failing because unable to get a lease

We used up all the free space on our test HDFS cluster so HBase crashed. After cleaning up some space, we were able to restart HBase, but after the startup a distributed log split job keeps failing.
The job looks like this:
Splitting log file hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 into a temporary staging area.
The Regionserver are trying to get a lease on the file for some time:
2013-10-24 11:50:47,662 DEBUG org.apache.hadoop.hbase.regionserver.SplitLogWorker: tasks arrived or departed
2013-10-24 11:50:47,671 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker host-4,60020,1382614844870 acquired task /hbase/splitlog/hdfs%3A%2F%2F192.168.249.1%3A9000%2Fhdfs%2Fhbase%2F.logs%2Fhost-3%2C60020%2C1382113928374-splitting%2Fhost-3%252C60020%252C1382113928374.1382523937002
2013-10-24 11:50:47,672 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002, length=41274332
2013-10-24 11:50:47,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering lease on dfs file hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002
2013-10-24 11:50:47,673 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=0 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 1ms
2013-10-24 11:50:50,674 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=1 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 3002ms
2013-10-24 11:50:51,674 DEBUG org.apache.hadoop.hbase.util.FSHDFSUtils: isFileClosed not available
2013-10-24 11:51:51,680 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=2 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 64008ms
Then the Master abort the job:
2013-10-24 11:55:48,685 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2013-10-24 11:55:48,687 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 interrupted, resigning
java.io.InterruptedIOException
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:136)
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:54)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:780)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:414)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:112)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:211)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:179)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:118)
... 9 more
It seems to me that the problem is the Regionserver which are unable to get a lease on this file, because it's already open, so I checked with sudo -u hdfs hadoop fsck /hdfs/hbase/.logs/ -openforwrite, and it confirms:
OPENFORWRITE: /hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 41274332 bytes, 1 block(s), OPENFORWRITE:
/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002: Under replicated blk_1073337163743094520_3534698. Target Replicas is 3 but found 2 replica(s).
I tried to shut down HBase, but the file stays OPENFORWRITE. How could I remove this flag?
ps> Hadoop 1.0.1, HBase 0.94.12

FATAL master.HMaster: Unexpected state : .. Cannot transit it to OFFLINE

I've got a serious Hbase crash problem. I'm using HBase 0.94.7 with one master and two region servers. The HBase master crashed regularly, I can't even get it restarted. I've got the master logs as following:
DEBUG master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=master,60020,1374506461230, region=46c2333f401964bf877254be19c2cc8c
DEBUG handler.ClosedRegionHandler: Handling CLOSED event for 6423df864603aa6e8c45c726ab3ae62f
DEBUG master.AssignmentManager: Forcing OFFLINE; was=LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF8\xB3\x8F\x17\xCE\xE2g\x84,1374498065657.6423df864603aa6e8c45c726ab3ae62f. state=CLOSED, ts=1374508769672, server=slave,60020,1374506460892
DEBUG zookeeper.ZKAssign: master:60000-0x14006f52f3f000e Creating (or updating) unassigned node for 6423df864603aa6e8c45c726ab3ae62f with OFFLINE state
FATAL master.HMaster: Unexpected state : LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. state=PENDING_OPEN, ts=1374508769697, server=master,60020,1374506461230 .. Cannot transit it to OFFLINE.
java.lang.IllegalStateException: Unexpected state : LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. state=PENDING_OPEN, ts=1374508769697, server=master,60020,1374506461230 .. Cannot transit it to OFFLINE.
at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
INFO master.HMaster: Aborting
DEBUG handler.ClosedRegionHandler: Handling CLOSED event for 0710b486dcb3d51465695b51db376255
....
DEBUG master.AssignmentManager: The znode of region LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. has been deleted.
INFO master.AssignmentManager: The master has opened the region LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. that was online on master,60020,1374506461230
DEBUG master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=master,60000,1374508461536, region=c9cfdd360c09b292412ba5ad88815e6f
DEBUG catalog.CatalogTracker: Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker#5c061cd2
INFO client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x14006f52f3f000f
INFO zookeeper.ZooKeeper: Session: 0x14006f52f3f000f closed
INFO zookeeper.ClientCnxn: EventThread shut down
INFO master.AssignmentManager$TimerUpdater: master,60000,1374508461536.timerUpdater exiting
INFO master.SplitLogManager$TimeoutMonitor: master,60000,1374508461536.splitLogManagerTimeoutMonitor exiting
INFO master.AssignmentManager$TimeoutMonitor: master,60000,1374508461536.timeoutMonitor exiting
INFO zookeeper.ZooKeeper: Session: 0x14006f52f3f000e closed
INFO zookeeper.ClientCnxn: EventThread shut down
INFO master.HMaster: HMaster main thread exiting
ERROR master.HMasterCommandLine: Failed to start master
I also found something unusual in the ZK log:
INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /master:37856
INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /master:37856
INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x140100dda0300e1 with negotiated timeout 180000 for client /master:37856
WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x140100dda0300e1, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:662)
INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /master:37856 which had sessionid 0x140100dda0300e1
Can anybody help to see what the problem is? Is it related to the unassigned region or something like this? I've tried the bin/hbase hbck -repair and bin/hbase hbck -fix, but it doesn't help.
Thanks
After checked the log of my region server very carefully, I got the answer.
Cause
It turns out that there is one library called 'SNAPPY' for the compression of the hbase table is not well installed on the region server. And all my tables are created using this compression algorithm. When the master tries to balance the region to the region server, it failed. Eventually the master aborted.
Solution
Install and configure the SNAPPY on EVERY NODE as following:
apt-get install libsnappy1
su hbase
mkdir /home/hbase/hbase-0.94.7/lib/native/Linux-amd64-64
ln -s /usr/lib/libsnappy.so.1.1.2 /home/hbase/hbase-0.94.7/lib/native/Linux-amd64-64/libsnappy.so
exit (-> root)
ln -s /usr/lib/libsnappy.so.1.1.2 /usr/lib64/libsnappy.so.1.1.2
ln -s /usr/lib/libsnappy.so.1.1.2 /usr/lib64/libsnappy.so.1
ln -s /usr/lib/libsnappy.so.1.1.2 /usr/lib64/libsnappy.so
ln -s /usr/lib/libsnappy.so.1 /usr/lib/libsnappy.so
Now everything is OK! The regions are well balanced over region servers.
Check the region server log, if it is caused by LZO compressor missing and you are using Cloudera Hadoop,you can install lzo easily according to the following instruction:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v1/v1-0-1/Installing-and-Using-Impala/ciiu_lzo.html

Resources