Pentaho MapReduce Job throwing error in Hortonworks Enviroment - hadoop

I am stuck with a strange problem. Pentaho Data integration provides sample Job "word Count Job" in order to understand MapReduce Jobs.
I am learning MapReduce and I am really lost with one strange error.
Error is :
"Caused by: java.io.IOException: Cannot initialize Cluster.
Please check your configuration for mapreduce.framework.name
and the correspond server addresses."
I have tried everything in my repertoire to resolve from chaging "plugin-properties" file in Pentaho data integration to re-installing Pentaho SHIM but to no avail.
As per the job's flow, file is correctly getting transferred to HDFS server from my local(where pentaho data integration is running) but the moment MapReduce job starts it throws error.

Finally cracked it. The error was because in core-site.xml file, "IP-Address" of cluster was mentioned by me where as "hostname" was recognized by the cluster. Hence, because of this ambiguity this error was happening.
Hurray!!

Related

Azure Databricks stream fails with StorageException: Could not verify copy source

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.

HBase components doesn't appear in Pentaho Kettle

I am trying to working with Pentaho, in order to build some big data solutions. But the Hadoop HBase components aren't appering in the dashboard. I don't understand why HBase doesn't appear, since HBase is up an running on my machine... I've been seeking for a solutions, but without success...
Please check this property value 'hbase.client.scanner.timeout.period' set to 10 mins in hbase-default.xml to get rid of hbase exceptions.
Check that you have added zookeeper host in the hbase output host in pentaho data integration tool.
Have you read this wiki in order to load hbase data into pentaho.

Spark EC-2 deployment error: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up

I have a question in regard to deploying a spark application on a standalone EC-2 Cluster. I have followed the tutorial by Spark ans was able to successfully deploy a standalone EC-2 cluster. I verified that by connecting to the clusrer UI and making sure that everything is as it supposed to be. I developed a simple application and tested it locally. Everything works fine. When I submit it to the cluster (just changing --master local[4] into --masers spark://.... ) I get the following error: ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. Does any one know how to overcome this problem. my deploy-mode is client.
Make sure that you have provided the correct url to the master.
Basically, the exact spark master URL is displayed on the page when you connected to the Web UI.
URL on the page is something like: Spark Master at spark://IPAddress:port
Also you may notice that web UI and the Spark running port numbers may be different

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

"Unable to verify integrity of data" while running MR job

I'm running a relatively big MR job using Amazon Elastic Map Reduce.
I ran the job plenty of times on small data sets with no problem.
But when trying to run it on a large dataset I'm getting the following exception:
Error: com.amazonaws.AmazonClientException: Unable to verify integrity
of data download. Client calculated content length didn't match
content length received from Amazon S3. The data may be corrupt.
I googled it and the only recommendation I got was to set the following:
System.setProperty("com.amazonaws.services.s3.disableGetObjectMD5Validation","true");
That didn't help at all.
I'm using replication 3, 11 M1Large datanodes and 1 M1Medium master node.
Any workaround or known fix for this issue?
Apparently, this is a known bug. Or so I've been told by an Amazon employee here.
It occurs when running on large datasets where an S3 object is bigger than 2GB.
I managed to work around it by moving to Hadoop 2.4.0 and AMI 3.1.0.

Resources