Is there a way to set number of shards and replicas while using JavaEsSpark api in spark - elasticsearch

I am using JavaEsSpark for wring in ElasticSearch in spark, I want to change the default no_of_shards of elasticsearch while index creation.
Below is the line of code that i am using that creates the index as well as write the rdd in it but how can i set no of shards in this?
JavaEsSpark.saveToEs(stringRDD, "index_name/docs");
I have also tried setting in sparkConf object but still not working.

Related

How to handle dynamic index creation in Elasticsearch using Apache NiFi?

I am routing data through to Elasticsearch using Nifi. I'm using NiFi to dynamically create indices based on a set of attributes. I'm using Index Lifecycle Policy Management in Elasticsearch which requires all indices to be manually bootstrapped beforehand for ILM settings to be applied. Since my NiFi flow automatically ingests messages into Elasticsearch any index created automatically will not have have ILM policies applied.
Currently my flow is Nifi Consume from Kafka --> Update Attribute --> PutElasticsearch Record.
A solution (I think) would be to call the invokehttp processor in front of the PutElasticsearch processor to bootstrap the indices dynamically via the attributes extracted before ingesting into elasticsearch. Indices are dynamically created using the syntax: index_${attribute_1}_${attribute_2}. My only concern here is the invoke invokehttpprocessor would run with every new flowfile. This could be thousands of calls to bootstrap an index. And if the index already exists there could be collision there.
Is this really the best way to do this? Perhaps I could run the QueryElasticsearchRecord processor to get a list of indices and somehow match that against incoming flowfiles on the attribute_1 and attribute_2 field. But that would still require a continuous query, I think?
What you could do is have the InvokeHTTP run if and only if it sees a specific value or attribute that would signal that a new (previously unsent) index value to input into ElasticSearch is required. Just an idea if you want to head down that route.

Use clevercloud drain with Elasticsearch target

I'm using Clevercloud to host my application and I want to store logs in an Elasticsearch index. I read the documentation here and I tried to create a drain as explained. I was able to create the drain but no data has been added in my Elasticsearch.
Somebody has an idea ?
I found the solution : I couldn't see datas because I was looking at the wrong ES index. Even if you specified an index in your URL, logs are in logstash format so by default it will create a new index per day named logstash-YYYY-MM-DD. The datas was in those indexes.

Kafka with ElasticsearchSinkConnector - Is it possible to define the data mapping in the connetor?

I use Kafka with the following connector connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
in order to send data to ElasticSearch. This combination already works well.
My problem:
My environment is fully dockered, and I need to set the whole system up multiple times per day. Each time, I need to map the structure of the data for each index on ES before I can send any data from Kafka. Otherwise, ES uses the wrong data types, and I can't work with the data without re-indexing. The dynamic mapping function sadly doesn't work for me good enough. Right now I use a bash script which sets the mappings for the index.
My question:
Is there a way to set/define the mapping already in the connector ? So I don't need to run my bash script ?

Elasticsearch for indexing multiple databases

I'm fairly new to Elasticsearch and I have tried to see if a answer to this questions exists already but could not find it. My question is, I have data in multiple datastores (Hadoop, cassandra, Oracle and maybe more in the future). I want to use Elasticsearch to index all of these datastores and create a "master index". Is this possible? Also would the processing of indexing "move" all my data into EC?
For hadoop data you can go for ES-Hadoop-Connector. Create an index with mappings before dumping data into Elastic Search and then use the same index for holding your data.
Configuration conf = new Configuration();
conf.set("es.nodes", "localhost:9200");
conf.set("es.resource.write", "Index_Name/Document_Type");
Similarly for all the remaining sources use the same index as sink. for each source change the corresponding Document_Type with same index name. so that it will become master index of you entire data.

ElasticSearch on Cassandra data vs moving Cassandra data to ElasticSearch for Indexing

I'm new to ElasticSearch and am trying to figure out what is the most optimal way to index 1 Terabyte of data in Cassandra.
Two options that I understand right now are:
Move data periodically to ElasticSearch using the Cassandra-River plugin and then run index on the data.
Advantage: Search queries create no impact on Cassandra load
Disadvantage: Have to sync the data periodically
Without moving the data run ElasticSearch on Cassandra to index the data (not sure how will this be done).
Advantage: Data always in sync
Disadvantage: Impacts Cassandra performance ?
Any thoughts would be appreciated.
Prehaps in the context of ElasticSearch 1.4 and above.. just using ElasticSearch as a datastore and search engine might be simpler and elegant option.
Add more nodes to scale.

Resources