Cloudera Hue running WordCount - hadoop

I have successfully installed and started up the CDH5 manager and agent. However whenever I try running the MR hello world job, ie WordCount, it runs upto 33% and stays in the same condition for a long time and it doesn't proceed.
Any clues as to where it might be going wrong?
FYI, when trying to run in the terminal it works fine.

It is recommended to switch Hue to use the CherryPy server instead of Spawning. In the hue.ini or the Hue Safety Valve in CM, enter:
[desktop]
use_cherrypy_server = true
These issues may be due to Beeswax crashing or being very slow and blocking all the requests as the Spawing Server is not perfectly greenified

Hue can use Oozie to submit jobs and it requires on more MR task. Usually the problem is that Yarn Apps asks for too much memory in your cluster (so decrease their default resources in yarn config) or it is gotcha #5 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

Related

Spark Launcher Jobs not starting because of token cant be found in cache after 24 hours

I have a Java Application, which runs continuously and checks a table in database for new records. When a New record is added in the table, the Java application do a unzip file and puts into HDFS location and then a Spark Job gets triggered(I am pro-grammatically triggering the Spark Job using 'SparkLauncher" class inside the Java Application), which does the processing for newly added file in HDFS location.
I have scheduled the Java Application in cluster using Oozie Java Action.
The cluster is HDP kerberized cluster.
The Job is working perfectly fine for 24 hours. All the unzip happens and spark job is running.
But after 24 hours the unzip happens in Java Application but the Spark Job is not get triggered in Resource Manager.
Exception : Exception encountered while connecting to the server :INFO: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner=****, renewer=oozie mr token, realUser=oozie, issueDate=1498798762481, maxDate=1499403562481, sequenceNumber=36550, masterKeyId=619) can't be found in cache
As per my understanding, after 24 hours oozie is renewing the token, and that token is not getting updated for the Spark launcher Job. The spark Launcher is still looking for the older Token which is not available in cache.
Please help me, how I can make Spark Launcher to look for the new-token.
As per my understanding, after 24 hours oozie is renewing the token
Why? Can you point to any documentation, source code, blog?
Remember that Oozie is a scheduler for batch jobs, and its canonical use case (at Yahoo!) is for triggering hourly jobs.
Only a pathological batch job would run for more than 24h, therefore renewal of the Hadoop delegation token is not really useful in Oozie.
But your Java thing acts as a service, running continuously, and needing automatic restart if it ever crashes. So you should consider...
either Slider, if you really want to run it inside YARN (although there
are many, many drawbacks -- how do you inspect the
logs of a running YARN job? how can you make sure that the app starts on time and is not delayed by a lack of resources? how can you make sure that your app will not be killed because YARN needs resources for a high-priority job?) but it is probably overkill for simply running your toy app
or a plain Linux service running on some Edge Node -- it's a Do-It-Yourself task, but not extremely complicated, and there are tutorials on the web
If you insist on using Oozie, in spite of all the limitations of both YARN and Oozie, then you have to change the way your app runs -- for instance, schedule the Coordinator to launch a job every 12h and pass the "nominal time" as Workflow property, edit the Workflow to pass that time to the Java app, edit the Java code so that the app exits at (arg + 11:58) and clears the way for the next exec.

IPython Notebook with Spark on EC2 : Initial job has not accepted any resources

I am trying to run the simple WordCount job in IPython notebook with Spark connected to an AWS EC2 cluster. The program works perfectly when I use Spark in the local standalone mode but throws the problem when I try to connect it to the EC2 cluster.
I have taken the following steps
I have followed instructions given in this Supergloo blogpost.
No errors are found until the last line where I try to write the output to a file. [The lazyloading feature of Spark means that this when the program really starts to execute]
This is where I get the error
[Stage 0:> (0 + 0) / 2]16/08/05 15:18:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Actually there is no error, we have this warning and the program goes into an indefinite wait state. Nothing happens until I kill the IPython notebook.
I have seen this Stackoverflow post and have reduced the number of cores to 1 and memory to 512 by using this options after the main command
--total-executor-cores 1 --executor-memory 512m
The screen capture from the SparkUI is as follows
sparkUI
This clearly shows that both core and UI is not being fully utilized.
Finally, I see from this StackOverflow post that
The spark-ec2 script configure the Spark Cluster in EC2 as standalone,
which mean it can not work with remote submits. I've been struggled
with this same error you described for days before figure out it's not
supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your
spark task.
If this is indeed the case, then there is nothing more to be done, but since this statement was made in 2014, I am hoping that in the last 2 years the script has been rectified or there is a workaround. If there is any workaround, I would be grateful if someone can point it out to me please.
Thank you for your reading till this point and for any suggestions offered.
You can not submit jobs except on the Master - as you see - unless you set up a REST based Spark job server.

Get status when running job without hadoop

When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....
You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.
In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI

Spark Shell stuck in YARN Accepted state

Running Spark 1.3.1 on Yarn and EMR. When I run the spark-shell everything looks normal until I start seeing messages like INFO yarn.Client: Application report for application_1439330624449_1561 (state: ACCEPTED). These messages are generated endlessly, once per second. Meanwhile, I am unable to use the Spark shell.
I don't understand why this is happening.
Seeing (near) endless Accepted messages from YARN has always been a sure sign that there were not enough cluster resources to allocate for my Spark jobs / shell. YARN will continue trying to schedule your Spark application, but will eventually time-out if not enough resources become available in a certain amount of time.
Are you providing any command line options to spark-shell that override the defaults provided? When I ask for too many executors/cores/memory YARN will accept my request but never transition to a Running ApplicationMaster.
Try running a spark-shell with no options (other than perhaps --master yarn) and see if it gets past Accepted.
Realized there were a couple of streaming jobs I had killed in the terminal, but I guess they were somehow still running. I was able to find these in the UI showing all running applications on YARN (I wasn't able to execute Hive queries as either). Once I killed the jobs using the command below the spark-shell started as usual.
yarn application -kill application_1428487296152_25597
I guess that YARN is not having resources enough for running jobs.
Please check
https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html
for calculating how many resources can you provide to YARN.
Please check the number of cores and the RAM quantity that it is controlled by the following variables:
yarn.nodemanager.resource.cpu-vcores
yarn.nodemanager.resource.memory-mb

hadoop web interface failed to show job history

I could access most functionality of hadoop admin site, like below:
But, when I tried to visit the history of each application, I am no luck any more:
Any body know what happens to my environment? Where should I check?
BTW, when I try to run "netstat -a" on my VM, I found no records for port 8088 or 19888, which is very unreasonable to me, because 8088 lead to hadoop main-page and works well.
In this web interface, you can see your jobs in real time if they are running or the history :
Once a M/R finish, the ressource manager does'nt matter of it. This is the job of the historyServer.
Your historyServer (optionnal part of hadoop YARN) seems not to be launched.
It's this service which listen on 19888.
You can launch it with the command : /etc/init.d/hadoop-mapreduce-historyserver start

Resources