Read from Cassandra with Cloudera Hadoop using Spark - hadoop

The scope is to read from HDFS, filter in Spark and write results to Cassandra.
I am packaging and running with SBT.
Here is the problem:
Reading from HDFS to Spark requires the following line in my sbt build file.
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.5.0"
However, reading and writing to Cassandra via
val casRdd = sc.newAPIHadoopRDD(
job.getConfiguration(),
classOf[ColumnFamilyInputFormat],
classOf[ByteBuffer],
classOf[SortedMap[ByteBuffer, IColumn]])
does only work if the library dependency of the hadoop-client is either left out or changed to 0.1 or 1.2.0 or 2.2.0 (non CDH) - unfortunately then the HDFS read is not possible.
If the hadoop-client line is added, the following Error is thrown when trying to read from Cassandra:
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
I therefore conclude that the problem with Cassandra read/write seems to be an issue which is Cloudera related? Please note that the Cassandra read/write works by simply deleting the libraryDependencies line.
Since the HDFS and Cassandra read need to work in the same project, how can this issue be resolved?

It seems you are trying to use Apache Hadoop distribution from Spark built against CDH.
Your project should never have to depend on hadoop-client as Spark already does. In our Sppark + Cassandra integration library Calliope we have a dependency on Spark -
"org.apache.spark" %% "spark-core" % SPARK_VERSION % "provided"
We have been using this library with Apache Hadoop HDFS, CDH HDFS and our own SnackFS. All you need to ennsure is that you deploy on the correect build of Spark.

Related

How to check the hadoop distribution used in my cluster?

How can I know whether my cluster has been setup using Hortonworks,Cloudera or normal installation of hadoop components?
Also how can I know the port number of various services?
It is difficult to identify hadoop distribution from port number, since Apache, Hortonworks, Cloudera distros uses different port numbers
Other options are to check for cluster management service agents (Cloudera Manager - agent start up script - /etc/init.d/cloudera-scm-agent , Hortonworks - Ambari agent start up script - /etc/init.d/ambari-agent, Vanilla Apache hadoop will not have any agents in the server
Another option is to check hadoop classpath, below command can be used to get the classpath.
`hadoop classpath`
Most of hadoop distributions include distro name in the classpath, If classpath doesn't contains any of below keywords, distribution/setup will be Apache/Normal installation.
hdp - (Hortonworks)
cdh - (Cloudera)
The simplest way is to run hadoop version command and in output you will see, what version of Hadoop you are having and also which distribution and its version you are running with. If you will find words like cdh or hdp then cdh stands for cloudera and hdp for hortonworks.
For example, here I am having cloudera and with hadoop version command below is output.
Here in first line Hadoop version followed by hadoop distribution and its version.
Hope this will help.
Command hdfs version will give you version of the hadoop and its distribution

Create hdfs when using integrated spark build

I'm working with Windows and trying to set up Spark.
Previously I installed Hadoop in addition to Spark, edited the config files, run the hadoop namenode -format and away we went.
I'm now trying to achieve the same by using the bundled version of Spark that is pre built with hadoop - spark-1.6.1-bin-hadoop2.6.tgz
So far it's been a much cleaner, simpler process however I no longer have access to the command that creates the hdfs, the config files for the hdfs are no longer present and I've no 'hadoop' in any of the bin folders.
There wasn't an Hadoop folder in the spark install, I created one for the purpose of winutils.exe.
It feels like I've missed something. Do the pre-built versions of spark not include hadoop? Is this functionality missing from this variant or is there something else that I'm overlooking?
Thanks for any help.
By saying that Spark is built with Hadoop, it is meant that Spark is built with the dependencies of Hadoop, i.e. with the clients for accessing Hadoop (or HDFS, to be more precise).
Thus, if you use a version of Spark which is built for Hadoop 2.6 you will be able to access HDFS filesystem of a cluster with the version 2.6 of Hadoop via Spark.
It doesn't mean that Hadoop is part of the pakage and downloading it Hadoop is installed as well. You have to install Hadoop separately.
If you download a Spark release without Hadoop support, you'll need to include the Hadoop client libraries in all the applications you write wiìhich are supposed to access HDFS (by a textFile for instance).
I am also using same spark in my windows 10. What I have done create C:\winutils\bin directory and put winutils.exe there. Than create HADOOP_HOME=C:\winutils variable. If you have set all
env variables and PATH like SPARK_HOME,HADOOP_HOME etc than it should work.

How to access s3a:// files from Apache Spark?

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including:
deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials
add hadoop-aws into maven => various transitive dependency conflicts
Has anyone successfully make both work?
Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.
Here are the key parts, as of December 2015:
Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
In spark.properties you probably want some settings that look like this:
spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY
If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - s3.<region>.amazonaws.com) must be specified.
--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.
I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3
Copy the AWS jars(hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar) which shipped with Hadoop by default
Hint: If the jar locations are unsure? Running find command as a privileged user can be helpful; commands can be
find / -name hadoop-aws*.jar
find / -name aws-java-sdk*.jar
into spark classpath which holds all spark jars
Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below
find / -name spark-core*.jar
in spark-defaults.conf
Hint: (Mostly it will be placed in /etc/spark/conf/spark-defaults.conf)
#make sure jars are added to CLASSPATH
spark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jar
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key={s3a.access.key}
spark.hadoop.fs.s3a.secret.key={s3a.secret.key}
#you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix.
in spark submit include jars(aws-java-sdk and hadoop-aws) in --driver-class-path if needed.
spark-submit --master yarn \
--driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \
--driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \
other options
Note:
Make sure the Linux user with reading privileges, before running the
find command to prevent error Permission denied
I got it working using the Spark 1.4.1 prebuilt binary with hadoop 2.6
Make sure you set both spark.driver.extraClassPath and spark.executor.extraClassPath pointing to the two jars (hadoop-aws and aws-java-sdk)
If you run on a cluster, make sure your executors have access to the jar files on the cluster.
We're using spark 1.6.1 with Mesos and we were getting lots of issues writing to S3 from spark. I give credit to cfeduke for the answer. The slight change I made was adding maven coordinates to the spark.jar config in the spark-defaults.conf file. I tried with hadoop-aws:2.7.2 but was still getting lots of errors so we went back to 2.7.1. Below are the changes in spark-defaults.conf that are working for us:
spark.jars.packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
spark.hadoop.fs.s3a.access.key <MY ACCESS KEY>
spark.hadoop.fs.s3a.secret.key <MY SECRET KEY>
spark.hadoop.fs.s3a.fast.upload true
Thank you cfeduke for taking the time to write up your post. It was very helpful.
Here are the details as of October 2016, as presented at Spark Summit EU: Apache Spark and Object Stores.
Key points
The direct output committer is gone from Spark 2.0 due to risk/experience of data corruption.
There are some settings on the FileOutputCommitter to reduce renames, but not eliminate them
I'm working with some colleagues to do an O(1) committer, relying on Apache Dynamo to give us that consistency we need.
To use S3a, get your classpath right.
And be on Hadoop 2.7.z; 2.6.x had some problems which were addressed by then HADOOP-11571.
There's a PR under SPARK-7481 to pull everything into a spark distro you build yourself. Otherwise, ask whoever supplies to the binaries to do the work.
Hadoop 2.8 is going to add major perf improvements HADOOP-11694.
Product placement: the read-performance side of HADOOP-11694 is included in HDP2.5; The Spark and S3 documentation there might be of interest —especially the tuning options.
as you said, hadoop 2.6 doesn't support s3a, and latest spark release 1.6.1 doesn't support hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.
for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support
if you still want to try to use s3a in spark 1.6.x, refer to the answer here: https://stackoverflow.com/a/37487407/5630352
Using Spark 1.4.1 pre-built with Hadoop 2.6, I am able to get s3a:// to work when deploying to a Spark Standalone cluster by adding the hadoop-aws and aws-java-sdk jar files from the Hadoop 2.7.1 distro (found under $HADOOP_HOME/share/hadoop/tools/lib of Hadoop 2.7.1) to my SPARK_CLASSPATH environment variable in my $SPARK_HOME/conf/spark-env.sh file.
You can also add the S3A dependencies to the classpath using spark-defaults.conf.
Example:
spark.driver.extraClassPath /usr/local/spark/jars/hadoop-aws-2.7.5.jar
spark.executor.extraClassPath /usr/local/spark/jars/hadoop-aws-2.7.5.jar
spark.driver.extraClassPath /usr/local/spark/jars/aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath /usr/local/spark/jars/aws-java-sdk-1.7.4.jar
Or just:
spark.jars /usr/local/spark/jars/hadoop-aws-2.7.5.jar,/usr/local/spark/jars/aws-java-sdk-1.7.4.jar
Just make sure to match your AWS SDK version to the version of Hadoop. For more information about this, look at this answer: Unable to access S3 data using Spark 2.2
Here's a solution for pyspark (possibly with proxy):
def _configure_s3_protocol(spark, proxy=props["proxy"]["host"], port=props["proxy"]["port"], endpoint=props["s3endpoint"]["irland"]):
"""
Configure access to the protocol s3
https://sparkour.urizone.net/recipes/using-s3/
AWS Regions and Endpoints
https://docs.aws.amazon.com/general/latest/gr/rande.html
"""
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", proxy)
sc._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", port)
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", endpoint)
return spark
Here is a scala version that works fine with Spark 3.2.1 (pre-built) with Hadoop 3.3.1, accessing a S3 bucket from a non AWS machine [typically a local setup on a developer machine]
sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "3.3.1",
"org.apache.hadoop" % "hadoop-common" % "3.3.1" % "provided"
)
spark program
val spark = SparkSession
.builder()
.master("local")
.appName("Process parquet file")
.config("spark.hadoop.fs.s3a.path.style.access", true)
.config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY)
.config("spark.hadoop.fs.s3a.secret.key", SECRET_KEY)
.config("spark.hadoop.fs.s3a.endpoint", ENDPOINT)
.config(
"spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem"
)
// The enable V4 does not seem necessary for the eu-west-3 region
// see #stevel comment below
// .config("com.amazonaws.services.s3.enableV4", true)
// .config(
// "spark.driver.extraJavaOptions",
// "-Dcom.amazonaws.services.s3.enableV4=true"
// )
.config("spark.executor.instances", "4")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.read.parquet("s3a://[BUCKET NAME]/.../???.parquet")
df.show()
note: region is in the form s3.[REGION].amazonaws.com e.g. s3.eu-west-3.amazonaws.com
s3 configuration
To make the bucket available from outside of AWS, add a Bucket Policy of the form:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement1",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::[ACCOUNT ID]:user/[IAM USERNAME]"
},
"Action": [
"s3:Delete*",
"s3:Get*",
"s3:List*",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::[BUCKET NAME]/*"
}
]
}
The supplied ACCESS_KEY and SECRET_KEY to the spark configuration must be those of the IAM user configured on the bucket
I am using spark version 2.3, and when I save a dataset using spark like:
dataset.write().format("hive").option("fileFormat", "orc").mode(SaveMode.Overwrite)
.option("path", "s3://reporting/default/temp/job_application")
.saveAsTable("job_application");
It works perfectly and saves my data into s3.

Which Hadoop 0.23.8 jars are needed for HBase 0.94.8

I'm using Hadoop 0.23.8 pseudo distributed and HBase 0.94.8. My HBase master is failing with:
Server IPC version 5 cannot communicate with client version 4
I think this is because HBase is using hadoop-core-1.0.4.jar in its lib folder.
Now http://cloudfront.blogspot.in/2012/06/how-to-configure-habse-in-pseudo.html#.UYfPYkAW38s suggests I should replace this jar by copying:
the hadoop-core-*.jar from your HADOOP_HOME ...
but there are no hadoop-core-*.jars in 0.23.8.
Will this process work for 0.23.8, and if so, which jars should I be using?
TIA!
I gave up with this and am using hadoop 2.2.0 which works well (ish) with HBase.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Resources