HDFS Showing 0 Blocks after cluster reboot - hadoop

I've setup a small cluster for testing / academic proposes, I have 3 nodes, one of which is acting both as namenode and datanode (and secondarynamenode).
I've uploaded 60GB of files (about 6.5 Million files) and uploads started to get really slow, so I read on the internet that I could stop the secondary namenode service on the main machine, at the moment it had no effect on anything.
After I rebooted all 3 computers, two of my datanodes show 0 blocks (despite showing disk usage in web interface) even with both namenodes services running.
One of the nodes with problem is the one running the namenode as well so I am guessing it is not a network problem.
any ideas on how can I get these blocks to be recognized again? (without start it all over again which took about two weeks to upload all)
Update
After half an hour after another reboot this showed in the logs:
2018-03-01 08:22:50,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x199d1a180e357c12, containing 1 storage report(s), of which we sent 0. The reports had 6656617 total blocks and used 0 RPC(s). This took 679 msec to generate and 94 msecs for RPC and NN processing. Got back no commands.
2018-03-01 08:22:50,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "Warpcore/192.168.15.200"; destination host is: "warpcore":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
And the EOF stack trace, after searching the web I discovered this [http://community.cloudera.com/t5/CDH-Manual-Installation/CDH-5-5-0-datanode-failed-to-send-a-large-block-report/m-p/34420] but still can't understand how to fix this.
The report block is too big and need to be split, but I don't know how or where should I configure this. I´m googling...

The problem seems to be low RAM on my namenode, as a workaround I added more directories to the namenode configuration as if I had multiple disks and rebalanced the files manually as instructed ins the comments here.
As hadoop 3.0 reports each disk separately the datenode was able to report and I was able to retrieve the files, this is an ugly workaround and not for production, but good enough for my academic purposes.
An interesting side effect was the datanode reporting multiple times the available disk space wich could lead into serious problems on production.
It seems a better solution is using HAR to reduce the number of blocks as described here and here

Related

Hadoop Data Corrupted Following Power Failure

I'm new to Hadoop and learning to use it by working with a small cluster where each node is an Ubuntu Server VM. The cluster consists of 1 name node and 3 data nodes with a replication factor of 3. After a power loss on the machine hosting the VMs, all files stored in the cluster were corrupted and with the blocks storing those files missing. No queries were running at the time power was lost and no files were being written to or read from the cluster.
If I shut down the VMs correctly (even without first stopping the Hadoop cluster), then the data is preserved and I don't run into any issues with missing or corrupted blocks.
The only information I've been able to find suggested setting dfs.datanode.sync.behind.writes to true, but this did not resolve the issue (killing the VMs from the host causes the same issue as a power failure). The information I found here seems to indicate this property will only have an effect when writing data to the disk.
I also tried running hdfs namenode -recover, but this did not resolve the issue. Ultimately I had to remove the data stored in the dfs.namenode.name.dir directory, rebooted each VM in the cluster to remove any Hadoop files in /tmp and reformatted the name node before copying the data back into the cluster from local file storage.
I understand that having all nodes in the cluster running on the same hardware and only 3 data nodes to go with a replication factor of 3 is not an ideal configuration, but I'd like a way to ensure that any data that is already written to disk is not corrupted by a power loss. Is there a property or other configuration I need to implement to avoid this in the future (besides separate hardware, more nodes, power backup, etc.)?
EDIT: To clarify further, the issue I'm trying to resolve is data corruption, not cluster availability. I understand I need to make changes to the overall cluster architecture to improve reliability, but I'd like a way to ensure data is not lost even in the event of a cluster-wide power failure.

Hadoop - BLOCK* blk_XXXXX on 10.XX.XX.XX size XX does not belongs to any file

I deleted multiple old files (HiveLogs/MR-Job intermediate files) from HDFS location /temp/hive-user/hive_2015*.
After that, I noticed my four node cluster is responding very slow and having the following issue.
I re-started my cluster, it worked fine for 3-4 hours, and then again it started giving same issue as follows:
Hadoopdfs health page is getting loaded very slowly.
File browsing is very slow
Namenode logs getting full with "Blocks does not belongs to any File".
All operation to my cluster is slow.
I found it could be because of I deleted hdfs files, according to HDFS JIRA- 7815 and 7480, as I deleted huge numbers of file Namenode could not delete blocks properly. As Namenode was busy with multiple deletions tasks. This is an existing bug with older version of Hadoop (older than 2.6.0).
Can anyone please suggest quick fix without upgrading my hadoop cluster or patch installation?
How can I identify those orphan blocks and delete them from Hadoop FS?

When are files closed in HDFS

I'm running into few issues when writing to HDFS (through flume's HDFS Sink). I think these are caused mostly because of the IO timeouts but not sure.
I end up with files that are open for write for a long long time and give the error "Cannot obtain block length for LocatedBlock{... }". It can be fixed if I explicitly recover the lease. I'm trying to understand what could cause this. I've been trying to reproduce this outside flume but have no luck yet. Could someone help me understand when such a situation could happen - A file on HDFS ends up not getting closed and stay like that until manual intervention to recover lease?
I thought the lease is recovered automatically based on the soft and hard limits. I've tried killing my sample code (I've also tried disconnecting network to make sure no shutdown hooks are executed) that is writing to HDFS to leave a file open for write but couldn't reproduce it.
We have had recurring problems with Flume, but it's substantially better with Flume 1.6+. We have an agent running on servers external to our Hadoop cluster with HDFS as the sink. The agent is configured to roll to new files (close current, and start a new one on the next event) hourly.
Once an event is queued on the channel, the Flume agent operates in a transaction manner -- file is sent, but not dequeued until the agent can confirm successful write to HDFS.
In the case where HDFS is unavailable to the agent (restart, network issue, etc.) there are files left on HDFS that are still open. Once connectivity is restored, Flume agent will find these stranded files and either continue writing to them, or close them normally.
However, we have found several edge cases where files seem to get stranded and left open, even after the hourly rolling has successfully renamed the file. I am not sure if this is a bug, a configuration issue, or just the way it is. When it happens, it completely messes up subsequent processing that needs to read the file.
We can find these files with hdfs fsck /foo/bar -openforwrite and can successfully hdfs dfs -mv them then hdfs dfs -cp from their new location back to their original one -- a horrible hack. We think (but have not confirmed) that hdfs debug recoverLease -path /foo/bar/openfile.fubar will cause the file to be closed, which is far simpler.
Recently we had a case where we stopped HDFS for a couple minutes. This broke the flume connections, and left a bunch of seemingly stranded open files in several different states. After HDFS was restarted, the recoverLease option would close the files, but moments later there would be more files open in some intermediate state. Within an hour or so, all the files had been successfully "handled" -- my assumption is that these files were reassociated with the agent channels. Not sure why it took so long -- not that many files. Another possibility is that it's pure HDFS cleaning up after expired leases.
I am not sure this is an answer to the question (which is also 1 year old now :-) ) but it might be helpful to others.

Continously shows Capacity used 90%

I've two questions.
How to mount the directory for Ambari disk usage.
I started to run the tera gen program and it does not go beyond 10% map tasks, Ambari continously shows me the message that: Capacity Used: [90.69%, 27.7 GB], Capacity Total: [30.5 GB], path=/usr/hdp I restarted the cluster, restarted Ambari but no use.
What is the way around?
Well,
After a few trial error I found the solution for the same.
You can change the location of log and local directories to bigger place
Remove the old log files from Ambari server.
Documented here.

SparkException: Master removed our application

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).
As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

Resources