Hadoop is sometimes too slow (stuck at 100%) - hadoop

I set up a cluster of ten machines in which I installed CDH4 (yarn).
I run the nameNode, the resourceManger and the historyServer in the same node, and the client in another node.
On the rest of machines, I turned on dataNode and NodeManager.
I launched my application on a 100GBytes file, it worked at first and it was relatively quick, but now it gets really really slow at the end of the map (around 90% 100% it takes 30 minutes).
I don't know if the problem comes from the way I coded the program or the way I configured cloudera CDH4.
The problem is that it works sometimes but does not work other times although I didn't change anything.

I found out why it took so much time at the end, in fact I thought that the command hadoop fs -expunge allows me to empty the trash but it doesn't, so when Hadoop tried to write in HDFS files the result it was very slow because there was a very little space left.

Related

Spark seems to scale vertically, without no configuration effort

I did an install of Hadoop 2.7.2 single node, almost manually following tutorial on internet, then almost manually I even installed Spark, starting from spark-1.6.0-bin-hadoop2.6.tgz, a version that claims to work with hadoop 2.6+.
I did no effort to configure Spark, just started to use it with interactive python: it worked immediately, that is a little surprising too, but anyway...
Then I decided to run an example in order to see if it scale properly vertically. My box as 4 CPU, so i decide to run an easy to parallelize job, ie the PI computation: http://spark.apache.org/examples.html.
Surprisingly, the result is this one ( the box is an old 4 core ):
and, from the PySpark console:
It seems to me that it scale perfectly on the 4 cpu cores I have.
Problem: I did not configure number of cores, and I don't know why, this is a typical behavior that will not work in production, and would be hard to explain what to configure. Is there some YARN/SPARK feature that automatically scale to the cores available?
Then, other question, is this using YARN? How can I understand if my Spark leverages YARN a s a cluster manager?

5.6 GB not enough for Cloudera?

I am running Cloudera Hadoop on my laptop and Oracle VirtualBox VM.
I have given 5.6 GB out of mine 8 and six from eight cores as well.
And still I am not able to keep it up and running.
Even without load services would not stay up and running and when I try a query at least Hive will be down within 20 minutes. And sometimes they go down like dominoes: one after another.
More memory seemed to help some: with 3GB and all services, Hue was blinking with red colors when the Hue itself managed to get up. And after rebooting it would takes 30 - 60 minutes before I manage to get the system up enough to even try running anything on it.
There has been two sensible notes (that I have managed to find):
- Warning of swapping.
- Crashing note when the system used 26 GB of virtual memory which was not enough.
My dataset is less than one megabyte, so it is hard to understand why the system would go up to dozens of gigabytes, but for whatever was reason for that has passed: now the system is running more steadily around the 5.6 GB that I have given to it after closing down a few services: see my answer to myself.
And still it is just more stable. Right after I got a warning of swapping and the Hive went down again. What could be reason for more-or-less all Hadoop services going down if the VM starts to swap?
I don't have enough reputation to post the picture to here, but when Hive went down again it was swapping 13 pages / second and utilizing 5.9 GB / 5.6 GB. So basically my system starts crashing more-or-less right after it start to swap. "428 pages were swapped to disk in the previous 15 minute(s)"
I have used default installation options as far as hard drive is concerned.
Only addition is a shared folder between Windows and VM. That works somewhat strangely locking files all the time, so I used it just like FTP and only for passing files from one system to another. Thus I can go days without using it, but systems still crash, so that is not the cause either.
Now that the system is mostly up, services crash still about twice a day: Service Monitor and Hive are quite even with their crashing frequency. After those come Activity Monitor and Event Server, which appear to crash always together. I believe Yarn crashes as well, but it gets up on its own. Last time Hive crashed first, and then it got followed by Service Monitor, Hive (second time), Activity Monitor and Event Server all.
As swap is disk, perhaps the problem is with disk:
# cat /etc/fstab
# swapoff -a
# badblocks -v /dev/VolGroup/lv_swap
Checking blocks 0 to 8388607
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.
# badblocks -vw /dev/VolGroup/lv_swap
Checking for bad blocks in read-write mode
From block 0 to 8388607
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found.
So nothing wrong with swap disk and I have not noticed any disk error anywhere else either.
Note that you could check file system from Windows side also. But I expect that if you make Windows to fix your Linux file system, you have good chances of destroying your Linux with that, so I did my checks somewhat pessimistically, because AFAIK these commands are safe to execute.
About half of the services kept going down, so giving more specifics would be a long story.
I succeeded to get the system more stable by closing down flume, hbase, impala, ks_indexer, oozie, spark and sqoop. And by increasing more memory to some remaining services that complained they had not been given enough memory.
Also I fixed couple of thing on the Windows side, I am not sure which one of these helped:
- MsMpEng.exe kept my hard drive busy. I didn't have permissions to kill it, but I decreased its priority to lowest possible.
- CcmExec.exe got to loop on my DVD and kept reading it for forever. This I solved by taking the DVD out from the drive. Then later on I killed the process tree to keep it from bothering for a while.
I found these using Windows resource manager.
The VM requires 4GB: http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html You should use that.
I am not clear whether you are using the QuickStart VM though. It's set up to run just the essential services and tuned to conserve memory rather than exploit lots of memory.
It sounds like you are running your own installation, on one virtual machine, on your Windows machine. You may be running an entire cluster's worth of services on one desktop machine. Each of these services has master, worker processes, monitoring processes, etc. You don't need most of them.
You also probably have left memory settings at default suitable for a server-class machine of 16+ GB RAM. Remember these services usually run across many machines, not all on one.
Finally, you're clearly swapping, and that makes things incredibly slow. Remember this is all through a VM too!
Bottom line, use the QuickStart VM if you really want a 1-machine cluster tuned correctly. If you want a real cluster or more services, you need more hardware.
Also consider: cloudera.com/live contains a full CDH 5.1 cluster + sample data, running on demand on AWS. Of course, the advantage of the VM is that you can BYOD, but if you're simply looking for a hands-on Hadoop experience, Live is a great option.

Could HDFS orphan files in datanode?

During a routine log pruning job where logs older than 60 days were being removed, a system administrator upgraded CDH from 4.3 to 4.6, (I know, I know)...
Normally, the log pruning job frees about 40% of HDFS's available storage. However, during the upgrade, datanodes went down, were rebooted, and all sorts of madness.
What's known is that HDFS received the delete commands, since the HDFS files / folders no longer exist, but disk utilization is still unchanged.
My question is, could HDFS have removed the files from the NameNode's metadata without actually fulfilling the file block deletes among the DataNodes, effectively orphaning the file blocks?
I think the namenode tells the datanodes to delete orphaned blocks, once it gets their report on the blocks they hold and it notices that some of them don't belong to any file.
If you don't want these blocks to be deleted, you can put the system in safemode and try to manually look through the disks and copy the data. There is no automated way of doing this, but a tool to list orphaned blocks may be added in the future (as suggested in this JIRA).
Additionally, you can try to check the health of the namesystem using Hadoop's fsck.

DataFlow difference in Hadoop Standalone and Pseudodistributed mode?

Can someone please tell me what is the difference in dataflow of Hadoop Standalone and Pseudodistributed mode. Infact I am trying to run an example of matrix multiplication presented by John Norstad. It runs fine in hadoop standalone mode but does not work properly in pseudodistributed mode. I am unable to fix the problem so please tell me the principle difference between hadoop standalone and pseudodistributed mode which can be helpful for fixing the stated problem.Thanks
Reagrds,
WL
In standalone mode everything (namenode, datanode, tasktracker, jobtracker) is running in one JVM on one machine. In pseudo-distributed mode, everything is running each in it's own JVM, but still on one machine. In terms of the client interface there shouldn't be any difference, but I wouldn't be surprised if the serialization requirements are more strict in pseudo-distributed mode.
My reasoning for the above is that in pseudo-distributed mode, everything must be serialized to pass data between JVMs. In standalone mode, it isn't strictly necessary for everything to be serializable (since everything is in one JVM, you have shared memory), but I don't remember if the code is written to take advantage of that fact, since that's not a normal use case for Hadoop.
EDIT: Given that you are not seeing an error, I think it sounds like a problem in the way the MapReduce job is coded. Perhaps he relies on something like shared memory among the reducers? If so, that would work in standalone mode but not in pseudo-distributed mode (or truly distributed mode, for that matter).

Hadoop Quirky Behavior

Is it possible that datanode doesn't start sometimes on running start-all.sh but when you restart the computer it starts fine. What could be a cause of such a quirky behavior?
Do other java processes running within the same namespace corrupt the hadoop processes?
Here are some of the things I observed.
hadoop namenode -format. This command needs to be executed before you start-all. Otherwise namenode doesn't start.
Also check that the application you are using on top of hadoop complies with the version of hadoop that the application expects. In my case I was using hadoop 20.203 and I switched to 20.2 my problem was solved.
Check the logs, they often give valuable insights into what has gone wrong.

Resources