sparkR 1.4.0 : how to include jars - elasticsearch

I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.apache.hadoop.io.NullWritable')
But for others, it does not:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.elasticsearch.hadoop.mr.LinkedMapWritable')
Yielding the error:
java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat
It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.
Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.
Thanks!
Ben

Here's how I've achieved it:
# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)
(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)
The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.

Related

Loading protocol buffer in ruby or java similar to node

I have a .proto file that contains my schema and service definition. I'm looking for a method in ruby/java that is similar to how Node loads and parses it (code below). Looking at the grpc ruby gem, I don't see anything that can replicate how Node does it.
Digging around I see this (https://github.com/grpc/grpc/issues/6708) which states that dynamically loading .proto files is only available in node. Hopefully, someone can provide me with an alternative.
Use case: Loading .proto files dynamically as provided in the client but I can only use either ruby or java to do it.
let grpc = require("grpc");
let loader = require("#grpc/proto-loader");
let packageDefinition = loader.loadSync(file.file, {});
let parsed = grpc.loadPackageDefinition(packageDefinition);
I've been giving this a try for the past few month and it seems like Node is the only way to read a protobuf file on runtime. Hopefully this helps anyone in the future that needs it.

Elastic Search and Spark

I am trying to set up spark and Eleastic search using the elasticsearch-spark library with the sbt artifact: "org.elasticsearch" %% "elasticsearch-spark" % "2.3.2". When I try to configure eleastic search with this code:
val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
.set("es.index.auto.create", "true")
.set("es.resource", "test")
.set("es.nodes", "test.com:9200")
I keep getting the error: illegal character for all of the set statements above for elastic search. Anyone know the issue?
You must have copied the code from any website or any other blog. It contains unreadable characters that are actually giving you trouble.
Simple solution: Delete all the content. Type one by one manually, and run it.. Let me know if you faced any problems again, i will help you out.
You might want to set the http.publish_host in your elasticsearch.yml to HOST_NAME. The es-hadoop connector is sniffing the nodes from the _nodes/transport API so it checks what the published http address is.

I am trying to upgrade my script from Cloudera hbase 4(CDH4) version to (CDH5)

def getRegions(config, servername)
connection = HConnectionManager::getConnection(config)
parts = servername.split(',')
puts parts
rs = connection.getHRegionConnection(parts[0], parts[1].to_i)
return rs.getOnlineRegions()
end
I am trying to make this code compatible with CDH5. I have looked into CDH5 library but unable to find exact solution.
I am using
connection = ConnectionFactory::createConnection(config) which returns Connection object.
I want list of onlineRegions on given server.
Have a look the following api's
Admin.html#getClusterStatus()
ClusterStatus.html#getServers()
Admin.html#getOnlineRegions(org.apache.hadoop.hbase.ServerName)
Note : In older versions, Some of the Admin functions live in HBaseAdmin class. (Rest of the usage should be same/similar)
Hopefully, that should help you.

How to get all versions of an hbase cell in a spark newAPIHadoopRDD?

I know when you use the Get API you can set MAX_VERSION_COUNT to get all versions of a cell. But I didn' t find any documentation on how to get all versions of cell with a map operation of spark newAPIHadoopRDD. I' ve tried with a naive result.getColumnCells() and it returns only 1 result. How can I set MAX_VERSION_COUNT in spark?
After taking a look at source code of TableInputFormat I found it reads configuration from hbase.mapreduce.scan.maxversions. So setting it like this works:
val conf = HBaseConfiguration.create()
conf.set("hbase.mapreduce.scan.maxversions", "VERSION_YOU_WANT")
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

Hadoop new API - Set OutputFormat

I'm trying to set the OutputFormat of my job to MapFileOutputFormat using:
jobConf.setOutputFormat(MapFileOutputFormat.class);
I get this error: mapred.output.format.class is incompatible with new reduce API mode
I suppose I should use the set setOutputFormatClass() of the new Job class but the problem is that when I try to do this:
job.setOutputFormatClass(MapFileOutputFormat.class);
it expects me to use this class: org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.
In hadoop 1.0.X there is no such class. It only exists in earlier versions (e.g 0.x)
How can I solve this problem ?
Thank you!
This problem has no decently easily implementable solution.
I gave up and used Sequence files which fit my requirements too.
Have you tried the following?
import org.apache.hadoop.mapreduce.lib.output;
...
LazyOutputFormat.setOutputFormatClass(job, MapFileOutputFormat.class);

Resources