Spark app unable to write to elasticsearch cluster running in docker - elasticsearch

I have a elasticsearch docker image listening on 127.0.0.1:9200, I tested it using sense and kibana, It works fine, I am able to index and query documents. Now when I try to write to it from a spark App
val sparkConf = new SparkConf().setAppName("ES").setMaster("local")
sparkConf.set("es.index.auto.create", "true")
sparkConf.set("es.nodes", "127.0.0.1")
sparkConf.set("es.port", "9200")
sparkConf.set("es.resource", "spark/docs")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val rdd = sc.parallelize(Seq(numbers, airports))
rdd.saveToEs("spark/docs")
It fails to connect, and keeps on retrying
16/07/11 17:20:07 INFO HttpMethodDirector: I/O exception (java.net.ConnectException) caught when processing request: Operation timed out
16/07/11 17:20:07 INFO HttpMethodDirector: Retrying request
I tried using IPAddress given by docker inspect for the elasticsearch image, that also does not work. However when I use a native installation of elasticsearch, the Spark App runs fine. Any ideas?

Also, set the config
es.nodes.wan.only to true
As mentioned in this answer if you are having issues writing to ES.

Couple things I would check:
The Elasticsearch-Hadoop spark connector version you are working with. Make sure that it is not beta. There was a fixed bug related to the IP resolving.
Since 9200 is the default port, you may remove this line: sparkConf.set("es.port", "9200") and check.
Check that there is no proxy configured in your Spark environment or config files.
I assume that you run Elasticsaerch and Spark on the same machine. Can you try to configure your machine IP address instead of 127.0.0.1
Hope this helps! :)

Had the same problem and a further issue was that the confs set using sparkConf.set() didn't have an effect. But supplying the confs with the saving function worked, like this:
rdd.saveToEs("spark/docs", Map("es.nodes" -> "127.0.0.1", "es.nodes.wan.only" -> "true"))

Related

Elasticsearch query in Julia

How do I connect Julia with Elasticsearch? Has anyone ever tried it, or found a package that is ready to use?
I know that in Julia we can use python package, but I still have no idea how to use it.
There it is:
#Installation
using Conda
Conda.add("elasticsearch")
# loading module and getting connection
using PyCall
elasticsearch = pyimport("elasticsearch")
es = elasticsearch.Elasticsearch() # <== this is the connection to ES
es.info() # connection information
# put some data
dat = Dict("a1"=>"blaaa", "a2"=>"hello")
res = es.index(index="data", doc_type="data", id="1", body=dat)
# fetch some data
q1 = Dict("query"=>Dict("match"=>Dict("a1"=>Dict("query"=>"blaaa"))))
es.search("data",body=q1)["hits"]["hits"]

how to change spark.r.backendConnectionTimeout value in RStudio?

I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.

Cannot connect to amqp://user:**#ip-11-222-12-117:5672//: Couldn't log in: a socket error occurred

I am trying to scale airflow using celery and rabbitMq on EC2.
I am following following code:
http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
Following is code in master node.
sql_alchemy_conn = postgresql+psycopg2://user:gues#localhost:5432/airflow
executor = CeleryExecutor
broker_url = amqp://user:gues#ip-11-222-12-117:5672
celery_result_backend = db+postgresql://user:gues#localhost:5432/airflow
Following is code for salve node:
sql_alchemy_conn = postgresql+psycopg2://user:gues#ip-11-222-12-117:5432/airflow
executor = CeleryExecutor
broker_url = amqp://user:gues#ip-11-222-12-117:5672
celery_result_backend = db+postgresql://user:gues#localhost:5432/airflow
When I run airflow scheduler, it works fine. But on slave node I am getting following error:
[2017-05-23 21:47:44,385: ERROR/MainProcess] consumer: Cannot connect to amqp://user:**#ip-11-222-12-117:5672//: Couldn't log in: a socket error occurred.
Trying again in 2.00 seconds..
However I am able to see both nodes connected using rabbitMq on rabbitMQ UI.
What I am doing wrong?
Have you checked that the amqp server is allowed to listen to anything other than the loopback? Please check this answer: Can't access RabbitMQ web management interface after fresh install

What is the right configuration of titan db 1.0 running against ES deployed on google/aws cloud

I'm using titan 1.0 with ES 1.51 running internally as a service (127.0.0.1), and it is working pretty well.
My working ES configuration is :
storage.backend=cassandra
storage.hostname=cassandraserver2-cassandra-00
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.25
query.fast-property=true
index.search.backend=elasticsearch
index.search.hostname=localhost
index.search.elasticsearch.interface=NODE
Now I want to redeploy the ES into the cloud , but unfortunately titan isn't up.
The exception i get is:
gremlin> tg = TitanFactory.open('../conf/titan-db.properties')
Could not instantiate implementation: com.thinkaurelius.titan.diskstorage.es.ElasticSearchIndex
Display stack trace? [yN] y
java.lang.IllegalArgumentException: Could not instantiate implementation: com.thinkaurelius.titan.diskstorage.es.ElasticSearchIndex
at com.thinkaurelius.titan.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:55)
at com.thinkaurelius.titan.diskstorage.Backend.getImplementationClass(Backend.java:473)
at com.thinkaurelius.titan.diskstorage.Backend.getIndexes(Backend.java:460)
at com.t...
Caused by: org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: []
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:279)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:198)
at org.elasticsearch.client.transport.support.InternalTransportClusterAdminClient.execute(InternalTransportClusterAdminClient.java:86)
What is the right configuration of titan properties to run against elasticsearch service on google/aws cloud ??
suppose the external ip of the "VM" is 8.35.193.69 and i reach this machine with ping
I'm using titan-db properties:
storage.backend=cassandra
storage.hostname=cassandraserver2-cassandra-00
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.25
query.fast-property=true
index.search.backend=elasticsearch
index.search.hostname=8.35.193.69
index.search.client-only=true
index.search.local-mode=false
index.search.elasticsearch.interface=NODE
any solutions are welcome
You need to make sure port 9300 is open on your instance. If it's not open, you need to:
Ensure ES service is up sudo service elasticsearch status
Ensure port 9300 is open and accepting requests. Check how here
If the port is closed, you need to enable TCP transport communication. Check here
Change your configuration to look like this:
# elasticsearch config
index.search.backend=elasticsearch
index.search.elasticsearch.interface=TRANSPORT_CLIENT
index.search.hostname=your_ip:9300

Not able to connect to hive on AWS EMR using java

I have setup AWS EMR cluster with hive. I want to connect to hive thrift server from my local machine using java. I tried following code-
Class.forName("com.amazon.hive.jdbc3.HS2Driver");
con = DriverManager.getConnection("jdbc:hive2://ec2XXXX.compute-1.amazonaws.com:10000/default","hadoop", "");
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html.As mentioned in the developer guide, added jars related with hive jdbc driver to class path.
But I am getting exception when trying to get connection.
I was able to connect to hive server on simple hadoop cluster using above code (with different jdbc driver).
Can someone please suggest if I am missing something?
Is it possible to connect to hive server on AWS EMR from local machine using hive jdbc?
(Merged Answer from the comments)
Hive is running on port 10000 but only locally, you have to create a ssh tunnel to the emr.
The following is from the documentation for hive 0.13.1
Create Tunnel
ssh -o ServerAliveInterval=10 -i path-to-key-file -N -L 10000:localhost:10000 hadoop#master-public-dns-name
Connect to JDBC
jdbc:hive2://localhost:10000/default
You can use the code using the library JSch
public static void portForwardForHive() {
try {
if(session != null && session.isConnected()) {
return;
}
JSch jsch = new JSch();
jsch.addIdentity(PATH_TO_SSH_KEY_PEM);
String host = REMOTE_HOST;
session = jsch.getSession(USER, host, 22);
// username and password will be given via UserInfo interface.
UserInfo ui = new MyUserInfo();
session.setUserInfo(ui);
session.connect();
int assingedPort = session.setPortForwardingL(LPORT, RHOST, RPORT);
System.out.println("Port forwarding done for the post : " + assingedPort);
} catch (Exception e) {
System.out.println(e);
}
}
Not sure if you've resolved this yet, but its a bug in EMR that's just bitten me.
For direct jdbc connectivity like you are doing, you must include the jdbc drivers in your shaded uber-jar. For jdbc access from within dataframes, you cannot access the jar in your uber-jar (another unrelated bug), but you must specify it on the command line (S3 is a convenient place to keep them):
--files s3://mybucketJAR/postgresql-9.4-1201.jdbc4.jar
However, even after this you will run into another problem if you are specifically trying to access hive. Amazon has built their own jdbc drivers with a different class hierarchy to the normal hive driver (com.amazon.hive.jdbc41.HS2Driver), however the EMR cluster includes the standard Hive jdbc driver in its standard path (org.apache.hive.jdbc.HiveDriver).
This is automatically registered as being capable of handling the jdbc:hive and jdbc:hive2 urls, so when you try to connect to a hive URL it finds this one first and uses it - even if you specifically register the amazon one. Unfortunately, this one is not compatible with amazon's EMR build of Hive.
There are two possible solutions:
1: Find the offending driver and unregister it:
Scala example:
val jdbcDrv = Collections.list(DriverManager.getDrivers)
for(i <- 0 until jdbcDrv.size) {
val drv = jdbcDrv.get(i)
val drvName = drv.getClass.getName
if(drvName == "org.apache.hive.jdbc.HiveDriver") {
log.info(s"Deregistering JDBC Driver: ${drvName}")
DriverManager.deregisterDriver(drv)
}
}
Or
2: As I found out later, you can specify the driver as part of the connect properties when you attempt to connect:
Scala example:
val hiveCredentials = new java.util.Properties
hiveCredentials.setProperty("user", hiveDBUser)
hiveCredentials.setProperty("password", hiveDBPassword)
hiveCredentials.setProperty("driver", "com.amazon.hive.jdbc41.HS2Driver")
val conn = DriverManager.getConnection(hiveDBURL, hiveCredentials)
This is a more "correct" version as it should override any preregistered handlers even if they have completely different class hierarchies.

Resources