Setting elasticsearch properties in spark-submit - elasticsearch

I'm trying to launch Spark jobs that use Elastic Search input via command line using spark-submit as described in http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
I'm setting the properties in a file, but when launching spark-submit it gives the following warnings:
~/spark-1.0.1-bin-hadoop1/bin/spark-submit --class Main --properties-file spark.conf SparkES.jar
Warning: Ignoring non-spark config property: es.resource=myresource
Warning: Ignoring non-spark config property: es.nodes=mynode
Warning: Ignoring non-spark config property: es.query=myquery
...
Exception in thread "main" org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed
My config file looks like (with correct values):
es.nodes nodeip:port
es.resource index/type
es.query query
Setting the properties in the Configuration object in the code works, but I need to avoid this workaround.
Is there a way to set those properties via command line?

I don't know if you resolved your issue (if so, how?), but I found this solution:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "spark/docs", Map("es.nodes" -> "10.0.5.151"))
Bye

When you pass a config file to spark-submit, it only loads configs that start with 'spark.'
So, in my config I simply use
spark.es.nodes <es-ip>
and in the code itself I have to do
val conf = new SparkConf()
conf.set("es.nodes", conf.get("spark.es.nodes"))

Related

Failed to set ConnectorClientConfigOverridePolicy to All in Debezium mysql connector in docker

Trying to set ConnectorClientConfigOverridePolicy, by adding CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY=All. During start up debezium connector fails with matches all=All. Seems, CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY duplicates value, instead "All", value is "all=All"
Stopping due to error [org.apache.kafka.connect.cli.ConnectDistributed]
org.apache.kafka.connect.errors.ConnectException: Failed to find any class that implements interface org.apache.kafka.connect.connector.policy.ConnectorClientConfigOverridePolicy and which name matches all=All
Is it bug or I am doing something wrong?
Using debezium docker 1.5
this is partly bug and partly misconfiguration.
The env var should be named CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY=All
But at the same time the start script that processes all env vars named CONNECT_ does not check for the underscore so CONNECTOR... matches too which breaks the further logic.

Kafka Connect Elasticsearch - NoSuchMethodError

I am trying to run the kafka-connect-elasticsearch plugin from Confluent in order to stream topics from Kafka (V0.11.0.1) directly into Elasticsearch (without putting Logstash in between).
I build the connector using Maven -
$ cd kafka-connect-elasticsearch
$ mvn clean package
I then created the require configuration file -
name=es-cluster-lab
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=filebeats-test
topic.index.map=filebeats-test:kafka_test_index
key.ignore=true
schema-ignore=true
connection.url=http://elastic:9200
type.name=log
As per the new Kafka Classpath Isolation spec, I also added the following line to my connect-standalone.properties file -
plugin.path=/home/kafka/kafka-connect-elasticsearch-3.3.0/target/kafka-connect-elasticsearch-3.3.0-development/share/java/kafka-connect-elasticsearch/
I go to run the script ...
bin/connect-standalone.sh config/connect-standalone.properties config/elasticsearch-connect.properties
... and receive the below error.
[2017-09-14 16:08:26,599] INFO Loading plugin from: /home/kafka/kafka-connect-elasticsearch-3.3.0/target/kafka-connect-elasticsearch-3.3.0-development/share/java/kafka-connect-elasticsearch/slf4j-api-1.7.25.jar (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:176)
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.collect.Sets$SetView.iterator()Lcom/google/common/collect/UnmodifiableIterator;
at org.reflections.Reflections.expandSuperTypes(Reflections.java:380)
at org.reflections.Reflections.<init>(Reflections.java:126)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:221)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:198)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:190)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:150)
at org.apache.kafka.connect.runtime.isolation.Plugins.<init>(Plugins.java:47)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:68)
I also tried to move the JAR files into the /app/kafka/libs directory (default CLASSPATH) and even tried to create a subdirectory /app/kafka/libs/connect_libs and add that manually to my CLASSPATH environment variable.
Not sure what my next step is besides putting Logstash between Kafka and Elastic.
try to change the guava version to 20 before you build it
I think you are missing the star '*' at the end of the path of the plugin path.
plugin.path=/home/kafka/kafka-connect-elasticsearch-3.3.0/target/kafka-connect-elasticsearch-3.3.0-development/share/java/kafka-connect-elasticsearch/*

Spring: Why I cant check for spring boot version without an error?

I am follow the guide for setting up spring boot with the following link.
http://docs.spring.io/spring-boot/docs/1.4.1.RELEASE/reference/htmlsingle/#getting-started-installing-the-cli
section 10.2.2
when I type $ spring --version
I receive the error below.
/cygdrive/c/Users/Jesse/Documents/.sdkman/candidates/springboot/current/bin/spring: line 83: [: C:\Program: binary operator expected
Error: Could not find or load main class org.springframework.boot.loader.JarLauncher
You need to set the SPRING_HOME variable.
After setting, SPRING_HOME was not resolving correctly for me even though it was set in windows as a user and a system variable, and was also visible when running export via Git Bash. I ended up replacing the last line in my spring.sh file, essentialy forcing the classpath for the java command:
"${JAVA_HOME}/bin/java" ${JAVA_OPTS} -cp "/drive_letter/dir/to/spring/spring-x.x.x.RELEASE/lib/spring-boot-cli-x.x.x.RELEASE.jar" org.springframework.boot.loader.JarLauncher "$#"

Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)

I am trying to access s3 files from local spark context using pySpark.
I keep getting File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
I had set os.environ['AWS_ACCESS_KEY_ID'] and
os.environ['AWS_SECRET_ACCESS_KEY'] before I called df = sqc.read.parquet(input_path). I also added these lines:
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", os.environ["AWS_SECRET_ACCESS_KEY"])
hadoopConf.set("fs.s3.awsAccessKeyId", os.environ["AWS_ACCESS_KEY_ID"])
I have also tried changing s3 to s3n, s3a. Neither worked.
Any idea how to make it work?
I am on Windows 10, pySpark, Spark 1.6.1 built for Hadoop 2.6.0
I'm running pyspark appending the libraries from hadoop-aws.
You will need to use s3n in your input path. I'm running that from Mac-OS. so I'm not sure if it will work in Windows.
$SPARK_HOME/bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
This package declaration works even in spark-shell
spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.1
and specify in the shell
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxxxxxxxxxxxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxxxxxxxxxxxxxxxx")

How to set system environment variable from Mapper Hadoop?

The problem below the line is solved but I am facing another problem.
I am doing this :
DistributedCache.createSymlink(job.getConfiguration());
DistributedCache.addCacheFile(new URI
("hdfs:/user/hadoop/harsh/libnative1.so"),conf.getConfiguration());
and in the mapper :
System.loadLibrary("libnative1.so");
(i also tried
System.loadLibrary("libnative1");
System.loadLibrary("native1");
But I am getting this error:
java.lang.UnsatisfiedLinkError: no libnative1.so in java.library.path
I am totally clueless what should I set java.library.path to ..
I tried setting it to /home and copied every .so from distributed cache to /home/ but still it didn't work :(
Any suggestions / solutions please?
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
I want to set the system environment variable (specifically, LD_LIBRARY_PATH) of the machine where the mapper is running.
I tried :
Runtime run = Runtime.getRuntime();
Process pr=run.exec("export LD_LIBRARY_PATH=/usr/local/:$LD_LIBRARY_PATH");
But it throws IOException.
I also know about
JobConf.MAPRED_MAP_TASK_ENV
But I am using hadoop version 0.20.2 which has Job & Configuration instead of JobConf.
I am unable to find any such variable, and this is also not a Hadoop specific environment variable but a system environment variable.
Any solution/suggestion?
Thanks in advance..
Why dont you export this variable on all nodes of the cluster ?
Anyways, use the Configuration class as below while submitting the Job
Configuration conf = new Configuration();
conf.set("mapred.map.child.env",<string value>);
Job job = new Job(conf);
The format of the value is k1=v1,k2=v2

Resources