Elastic Search and Spark - elasticsearch

I am trying to set up spark and Eleastic search using the elasticsearch-spark library with the sbt artifact: "org.elasticsearch" %% "elasticsearch-spark" % "2.3.2". When I try to configure eleastic search with this code:
val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
.set("es.index.auto.create", "true")
.set("es.resource", "test")
.set("es.nodes", "test.com:9200")
I keep getting the error: illegal character for all of the set statements above for elastic search. Anyone know the issue?

You must have copied the code from any website or any other blog. It contains unreadable characters that are actually giving you trouble.
Simple solution: Delete all the content. Type one by one manually, and run it.. Let me know if you faced any problems again, i will help you out.

You might want to set the http.publish_host in your elasticsearch.yml to HOST_NAME. The es-hadoop connector is sniffing the nodes from the _nodes/transport API so it checks what the published http address is.

Related

Loading protocol buffer in ruby or java similar to node

I have a .proto file that contains my schema and service definition. I'm looking for a method in ruby/java that is similar to how Node loads and parses it (code below). Looking at the grpc ruby gem, I don't see anything that can replicate how Node does it.
Digging around I see this (https://github.com/grpc/grpc/issues/6708) which states that dynamically loading .proto files is only available in node. Hopefully, someone can provide me with an alternative.
Use case: Loading .proto files dynamically as provided in the client but I can only use either ruby or java to do it.
let grpc = require("grpc");
let loader = require("#grpc/proto-loader");
let packageDefinition = loader.loadSync(file.file, {});
let parsed = grpc.loadPackageDefinition(packageDefinition);
I've been giving this a try for the past few month and it seems like Node is the only way to read a protobuf file on runtime. Hopefully this helps anyone in the future that needs it.

AWS copying one object to another

I'm trying to copy data from one bucket to another using the ruby "aws-sdk" gem version 3.
My code is shown below:
temporary_object = #temporary_bucket.object(temporary_path)
permanent_object = #permanent_bucket.object(permanent_path)
temporary_object.copy_to(permanent_object)
However I keep getting the error Aws::S3::Errors::NoSuchKey: The specified key does not exist. Which makes sense as the permanent bucket doesn't exist at this moment however I thought that using copy_to creates the bucket if it does not exist.
Any advice would be very helpful.
Thanks

sparkR 1.4.0 : how to include jars

I'm trying to hook SparkR 1.4.0 up to Elasticsearch using the elasticsearch-hadoop-2.1.0.rc1.jar jar file (found here). It's requiring a bit of hacking together, calling the SparkR:::callJMethod function. I need to get a jobj R object for a couple of Java classes. For some of the classes, this works:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.apache.hadoop.io.NullWritable')
But for others, it does not:
SparkR:::callJStatic('java.lang.Class',
'forName',
'org.elasticsearch.hadoop.mr.LinkedMapWritable')
Yielding the error:
java.lang.ClassNotFoundException:org.elasticsearch.hadoop.mr.EsInputFormat
It seems like Java isn't finding the org.elasticsearch.* classes, even though I've tried including them with the command line --jars argument, and the sparkR.init(sparkJars = ...) function.
Any help would be greatly appreciated. Also, if this is a question that more appropriately belongs on the actual SparkR issue tracker, could someone please point me to it? I looked and was not able to find it. Also, if someone knows an alternative way to hook SparkR up to Elasticsearch, I'd be happy to hear that as well.
Thanks!
Ben
Here's how I've achieved it:
# environments, packages, etc ----
Sys.setenv(SPARK_HOME = "/applications/spark-1.4.1")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
# connecting Elasticsearch to Spark via ES-Hadoop-2.1 ----
spark_context <- sparkR.init(master = "local[2]", sparkPackages = "org.elasticsearch:elasticsearch-spark_2.10:2.1.0")
spark_sql_context <- sparkRSQL.init(spark_context)
spark_es <- read.df(spark_sql_context, path = "index/type", source = "org.elasticsearch.spark.sql")
printSchema(spark_es)
(Spark 1.4.1, Elasticsearch 1.5.1, ES-Hadoop 2.1 on OS X Yosemite)
The key idea is to link to the ES-Hadoop package and not the jar file, and to use it to create a Spark SQL context directly.

How to get all versions of an hbase cell in a spark newAPIHadoopRDD?

I know when you use the Get API you can set MAX_VERSION_COUNT to get all versions of a cell. But I didn' t find any documentation on how to get all versions of cell with a map operation of spark newAPIHadoopRDD. I' ve tried with a naive result.getColumnCells() and it returns only 1 result. How can I set MAX_VERSION_COUNT in spark?
After taking a look at source code of TableInputFormat I found it reads configuration from hbase.mapreduce.scan.maxversions. So setting it like this works:
val conf = HBaseConfiguration.create()
conf.set("hbase.mapreduce.scan.maxversions", "VERSION_YOU_WANT")
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

debugging elasticsearch

I'm using tire and elasticsearch. The service has started using port 9200. However, it was returning 2 errors:
"org.elasticsearch.search.SearchParseException: [countries][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"query_string":{"query":"name:"}}}]]"
and
"Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse 'name:': Encountered "<EOF>" at line 1, column 5."
So, I reinstalled elasticsearch and the service container. Service starts fine.
Now, when I search using tire I get no results when results should appear and I don't receive any error messages.
Does anybody have any idea how I might find out what is wrong, let alone fix it?
first of all, you don't need to reindex anything, in the usual cases. It depends how you installed and configured elasticsearch, but when you install and upgrade eg. with Homebrew, the data are persisted safely.
Second, no need to reinstall anything. The error you're seeing means just what it says on the tin: SearchParseException, ie. your query is invalid:
{"query":{"query_string":{"query":"name:"}}}
Notice that you didn't pass any query string for the name qualifier. You have to pass something, eg:
{"query":{"query_string":{"query":"name:foo"}}}
or, in Ruby terms:
Tire.index('test') { query { string "name:hey" } }
See this update to the Railscasts episode on Tire for an example how to catch errors due to incorrect Lucene queries.

Resources