Making a Cassandra Connection inside Hadoop MapReduce Task - hadoop

I am successfully using the DataStax Java Driver to access Cassandra inside my Java code just before I start a MapReduce Job.
cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
However I am needing to check additional information to decide on a per record basis how to reduce the record. If I attempt to use the same code inside a Hadoop Reducer class it fails to connect with the error:
INFO mapred.JobClient: Task Id :
attempt_201310280851_0012_r_000000_1, Status : FAILED
com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: /127.0.0.1 ([/127.0.0.1]
Unexpected error during transport initialization
(com.datastax.driver.core.TransportException: [/127.0.0.1] Error writing)))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:186)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:81)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:662)
at com.datastax.driver.core.Cluster$Manager.access$100(Cluster.java:604)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:69)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:96)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:585)
The MapReduce input and output will successfully read and write to Cassandra. As I mentioned I can connect before I run the job so I do not think it is an issue with the Cassandra server.

Related

Write to HDFS/Hive using NiFi

I'm using Nifi 1.6.0.
I'm trying to write to HDFS and to Hive (cloudera) with nifi.
On "PutHDFS" I'm configure the "Hadoop Confiugration Resources" with hdfs-site.xml, core-site.xml files, set the directories and when I'm trying to Start it I got the following error:
"Failed to properly initialize processor, If still shcedule to run,
NIFI will attempt to initalize and run the Processor again after the
'Administrative Yield Duration' has elapsed. Failure is due to
java.lang.reflect.InvocationTargetException:
java.lang.reflect.InvicationTargetException"
On "PutHiveStreaming" I'm configure the "Hive Metastore URI" with
thrift://..., the database and the table name and on "Hadoop
Confiugration Resources" I'm put the Hive-site.xml location and when
I'm trying to Start it I got the following error:
"Hive streaming connect/write error, flow file will be penalized and routed to retry.
org.apache.nifi.util.hive.HiveWritter$ConnectFailure: Failed connectiong to EndPoint {metaStoreUri='thrift://myserver:9083', database='mydbname', table='mytablename', partitionVals=[]}:".
How can I solve the errors?
Thanks.
For #1, if you got your *-site.xml files from the cluster, it's possible that they are using internal IPs to refer to components like the DataNodes and you won't be able to reach them directly using that. Try setting dfs.client.use.datanode.hostname to true in your hdfs-site.xml on the client.
For #2, I'm not sure PutHiveStreaming will work against Cloudera, IIRC they use Hive 1.1.x and PutHiveStreaming is based on 1.2.x, so there may be some Thrift incompatibilities. If that doesn't seem to be the issue, make sure the client can connect to the metastore port (looks like 9083).

Hadoop Job fails when using PIG

I am trying to run basic pig script on cluster not able to submit job it gets fail with the following warning message:
WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want
Pig to stop immediately on failure.
Not able to read file in the cluster using PIG even when specifying correct location can anyone help me on this?
ERROR
Failed to read data from "hdfs://172.16.30.119:8020/user/hadoop/wordfile"
CODE
data = load 'hdfs://172.16.30.119:8020/user/hadoop/wordfile';

Unable to schedule falcon process - Could not perform authorization operation, java.io.IOException: Couldn't set up IO streams

​Hi,
I am trying to schedule a falcon process using falcon CLI and falcon service user on a Kerberised cluster. I am getting the following error message:
ERROR: Bad Request;default/org.apache.falcon.FalconWebException::org.apache.falcon.FalconException: Entity schedule failed for process: testHiveProc
Falcon app logs shows following:
used by: org.apache.falcon.FalconException: E0501 : E0501: Could not perform authorization operation, Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details :
Any suggestions?
Thanks.
Root cause:
Oozie was running out of processes due to more number of scheduled jobs.
Short term solution:
Restart Oozie server
Long term solution:
- Increase ulimit
- Limit the number of scheduled jobs in Oozie

Pig, Oozie, and HBase - java.io.IOException: No FileSystem for scheme: hbase

My Pig script works fine on its own, until I put it in an Oozie workflow, where I receive the following error:
ERROR 2043: Unexpected error during execution.
org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution.
...
Caused by: java.io.IOException: No FileSystem for scheme: hbase
I registered the HBase and Zookeeper jars successfully, but received the same error.
I also attempted to set the Zookeeper Quorum by adding variation of these lines in the Pig script:
SET hbase.zookeeper.quorum 'vm-myhost-001,vm-myhost-002,vm-myhost-003'
Some searching on the internet instructed me to add this to the beginning of my workflow.xml:
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false
This solved the problem. I was even able to remove the registration of the HBase and Zookeeper jars and the Zookeeper quorum.
Now after double checking, I noticed that my jobs actually do their job: they store the results in HBase as expected. But, Oozie claims that a failure occurred, when it didn't.
I don't think that setting the mapreduce.fileoutputcommitter.marksuccessfuljobs to false constitutes a solution.
Are there any other solutions?
It seems that there is currently no real solution for this.
However, this answer to a different question seems to indicate that the best workaround is to create the success flag 'manually'.

UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation

I'd like to know your input as to why this error is happening. On production environment onshore, we're using CDH4. On our local testing environment, we're just using Apache Hadoop v2.2.0. When I run the same jar compiled on CDH4, the MR jobs are executed fine. But when I run the jar on Hadoop v2.2.0 (YARN enabled), I get this error:
INFO mapreduce.Job: Task Id : attempt_1391062333435_0001_m_000000_0, Status : FAILED
Error: java.lang.UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation
The log showed Map jobs ran successfully, but the Reduce jobs - all of them failed - with the above error. There's not too many hits on Google regarding this error so I'm kind of nowhere to run but here.
Any thoughts guys? Thanks.
Sorry for the lateness of this reply.
This problem was solved when we synched our environment with the one onshore. That is, instead of using plain Apache Hadoop, we used the Cloudera distribution.

Resources