Interfacing Hadoop to Cassandra on Amazon AWS - netty version conflict?

I've got a Hadoop map reduce class that runs on Amazon EMR and outputs to an HDFS flatfile. All well and good, but now I need to output to a Cassandra DB, also running on AWS. I built and ran a local client and got that working, then moved the Cassandra writing code to my Hadoop project. The problem, it seems, is that Amazon draws in /home/hadoop/lib/netty-3.2.4.Final.jar for Hadoop 1.0.3, but the Cassandra that runs on AWS is 1.2.6 and uses netty-3.5.9.Final.jar.
What can I do to prevent or circumvent this conflict? Can I draw in my newer version of netty along side the one Amazon EMR draws in?
The error I get from running the jar on EMR is as follows:
Exception in thread "main" java.lang.NoSuchMethodError: org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.<init>(IIIIIZ)V
at org.apache.cassandra.transport.Frame$Decoder.<init>(
at com.datastax.driver.core.Connection$PipelineFactory.getPipeline(
at org.jboss.netty.bootstrap.ClientBootstrap.connect(
at org.jboss.netty.bootstrap.ClientBootstrap.connect(
at com.datastax.driver.core.Connection.<init>(
at com.datastax.driver.core.Connection.<init>(
at com.datastax.driver.core.Connection$
at com.datastax.driver.core.ControlConnection.tryConnect(
at com.datastax.driver.core.ControlConnection.reconnectInternal(
at com.datastax.driver.core.ControlConnection.connect(
at com.datastax.driver.core.Cluster$Manager.init(
at com.datastax.driver.core.Cluster$Manager.access$100(
at com.datastax.driver.core.Cluster.<init>(
at com.datastax.driver.core.Cluster.buildFrom(
at com.datastax.driver.core.Cluster$
at GraphAnalysis.MatrixBuilder.compileOutput(
at GraphAnalysis.MatrixBuilder.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.util.RunJar.main(

I had a similar problem with one of our internal modules which had used netty-3.2.4.jar which had a missing function.
Ultimately I had to do open-heart surgery on the jar file.
Expand the jar file that embeds the old netty library using jar
-xvf oldlibrary.jar
Delete the folder org/jboss/netty.
Then expand the new netty jar file into a separate folder
Copy the new org/jboss/netty folder along with its contents to the old location cp -prf netty .
Create a new JAR package jar -cvf ../new_jarfile.jar *
This is a rare case, but you can apply this method to overcome library incompatibilities whenever they occur. Make sure that your program still runs. If you change the underlying library this way, there are changes that your original program will not run. A well-designed library should be agnostic of such changes though.


Storm Topology deployment using pre deployed jar

We currently have a jar that contains 6 topologies. To deploy these topologies we currently do 6 separate calls using
Looking at the log output, each time the topology is submitted the jar is also uploaded to nimbus i.e there are 6 lines like this
9937 [main] INFO o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location:...
I want to avoid the multiple uploading of the jar. I have tried uploading the jar via scp and placing it in at "uploadedJarLocation" on the nimbus node ( I do this once).
Then changing my deployment code to use the following for each of the topologies.
nimbusClient = NimbusClient.getConfiguredClient(storm_conf);
client = nimbusClient.getClient();
client.submitTopology(topologyName, uploadedJarLocation, jsonConf, topology.buildTopology());
This has sped things up and seems to work fine but I want to ask
Is this a safe approach, can I safely reference the uploadedJarLocation I pre-uploaded to nimbus via scp?
Are there any alternative methods to avoid the multiple jar upload?
I know about the StormSubmitter.submitJar as an alternative but have found this to be slow.

Flink error - org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

I am trying to run a flink job using a file from HDFS. I have created a dataset as following -
DataSource<Tuple2<LongWritable, Text>> visits = env.readHadoopFile(new TextInputFormat(), LongWritable.class,Text.class, Config.pathToVisits());
I am using flink's latest version - 0.9.0-milestone-1-hadoop1
(I have also tried with 0.9.0-milestone-1)
whereas my Hadoop version is 2.6.0
But, I get the following exception when I try to execute the job. I have searched for similar problem, and it is related to version incompatibility between client and hdfs.
Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
at org.apache.hadoop.ipc.RPC$Invoker.invoke(
at com.sun.proxy.$Proxy5.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Can you please let me know what changes should I make in my pom, so that it points to correct Hadoop/HDFS version? or changes elsewhere?
Or I need to downgrade the hadoop installation?
Have you tried the Hadoop-2 build of Flink? Have a look at the downloads page. There is a build called flink-0.9.0-milestone-1-bin-hadoop2.tgz that should work with Hadoop 2.

Is it possible to run Hadoop jobs (like the WordCount sample) in the local mode on Windows without Cygwin?

I have Windows 7, Java 8, Maven and Eclipse.
I've created a Maven project and used almost exactly the same code as here.
It's just a simple "word count" sample.
I try to launch the "driver" program from Eclipse, I provide command line arguments (the input file and the output directory) and get the following error:
Exception in thread "main" java.lang.NullPointerException at
java.lang.ProcessBuilder.start( at
org.apache.hadoop.util.Shell.runCommand( at at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute( at
org.apache.hadoop.util.Shell.execCommand( at
org.apache.hadoop.util.Shell.execCommand( at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission( at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( at
org.apache.hadoop.fs.FilterFileSystem.mkdirs( at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal( at
org.apache.hadoop.mapreduce.Job$ at
org.apache.hadoop.mapreduce.Job$ at Method) at at at
org.apache.hadoop.mapreduce.Job.submit( at
org.apache.hadoop.mapreduce.Job.waitForCompletion( at
The failing line ( contains the command to launch the job:
I want to make it work and therefore I want to understand something:
Do I have to provide any hdfs-site.xml, yarn-site.xml, ... all this, if I want just the local mode (without any cluster)?
I don't have these XML config files now. As far as I remember, the defaults are all OK for the local mode, maybe I am wrong.
Is it possible at all under Windows (to launch any Hadoop jobs whatsoever) or the whole Hadoop thing is Linux-only?
The Hadoop dependency is the following:
Download Hadoop 2.6.0 or 2.7.1 compiled for Windows
Create HADOOP_HOME environment variable pointing to the unzipped dir
Add %HADOOP_HOME%\bin to PATH env var
Hadoop runs on Windows, it is possible, but you'll grow white hair if you try to pull it off on your own.
To start with, all filesystem operations in Windows Hadoop are routed either through the NativeIO, if available, or via winutils if NativeIO is not loaded. In your case it took the winutils path. You could make NativeIO available if you instruct Eclipse where to find it. See How to add native library to “java.library.path” with Eclipse launch (instead of overriding it), you need to add the location of that hadoop-common-project project target's bin, where you'll find hadoop.dll which hosts the NativeIO. But even after that, you'll still need wintils for container launch. The winutils.exe will be in that same location (the hadoop-common target/bin), but the code looks for it based on %HADOOP_HOME%, so you'll have to define that. And it will go uphill from there. I intentionally omitted the details how to configure all these because I don't think you should, or to be more precise, you should only if you understand how to do it.
It would be much much easier if you take an off-the-shelf Hadoop distribution for Windows, of which there are exactly one: the HDP from Hortonworks, download it, install it, configure it and then run against the 'cluster'.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Hue Hive -- Beeswax Server Can't Find JDBC Driver for MySQL

We're using the Cloudera 3.7.5 and having a tough time configuring the Beeswax server such that the Hue can access the Hive databases. I followed all the instructions from the Cloudera documentation that to setup MySQL to serve as Hive's metastore, but when I restart the Hue services and check Beeswax server's StdErr logs, I still see the painful "javax.jdo.JDOFatalInternalException: Error creating transactional connection factory" which is caused by
org.datanucleus.exceptions.NucleusException: Attempt to invoke the "DBCP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
This is bizzare to me, because the logs also indicate that the environment variable HIVE_HOME is equal to "/usr/lib/hive", and sure enough I have copied the "mysql-connector-java-5.1.15-bin.jar" into the /usr/lib/hive/lib directory, as the documents dictate.
I have also tried the instructions on the blog post, which involved copying the the mysql-connector jar into "/usr/share/hue/apps/beeswax/hive/lib/". Unfortunately I did not have a hive/lib subdirectory in the beeswax folder, so I attempted to make one. This also did not work.
Any advice how I can get the MySQL JDBC library onto Beeswax's classpath?
We finally decided to just bite the bullet and upgrade to CDH4. Placing the JDBC jar in /usr/share/hive/lib allowed the Beeswax server to function perfectly without issue.
If anyone else is experiencing this issue I recommend upgrading from CDH3 to CDH4, the UI is much cleaner, smoother, and we had much fewer installation and maintenance bugs with CDH4.
You have to paste your mysql connector in HUE_HOME/apps/beeswax/hive/lib.
If this path doesn't exist, create hive/lib and then paste the mysql connector. I hope your problem will be solved.
When you start using cloudera 4.5 they move everything into parcels, so this exact problem on my hive meta server was fixed by this command (below). Essentially you're just re-adding modules. I'm sure you can modify the extra classpath in the hive config file to make this oblivious to parcel updates.
cp /usr/lib/hive/lib/mysql-connector-java-5.1.17-bin.jar /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/lib/.
So a real fix might be something like this:
cp `locate mysql-connector | grep jar | head -n 1` /opt/cloudera/parcels/*/lib/hive/lib/.
which would copy the jar into every parcel.
