Using the Spark Kinesis connector to connect to a specific shard - spark-streaming

I'm using KinesisUtil to consume a Kinesis stream in a Spark application. The code example I'm working with makes as many 'createStream' calls as the number of shards on the Kinesis Stream.
Is there a way to have KinesisUtil connect to a specific shard instead? I am implementing a design that requires specific Spark nodes processing events from specific shards.
Thanks,
Ranjit

Related

For Kafka sink Connector I send a single message to multiple indices documents in elasticseach?

I am recieving a very complex json inside a topic message, so i want to do some computations with it using SMTs and send to different elasticsearch indice documents. is it possible?
I am not able to find a solution for this.
The Elasticsearch sink connector only writes to one index, per record, based on the topic name. It's explicitly written in the Confluent documentation that topic altering transforms such as RegexRouter will not work as expected.
I'd suggest looking at logstash Kafka input and Elasticsearch output as an alternative, however, I'm still not sure how you'd "split" a record into multiple documents there either.
You may need an intermediate Kafka consumer such as Kafka Streams or ksqlDB to extract your nested JSON and emit multiple records that you expect in Elasticsearch.

How to synchronize HBase with data in elastic search? (transaction?) And it needs to be real-time

My company belongs to the Internet of Things industry. The structure I am responsible for (data flow) is EMQTT=> kafka=> hbase=> phoenix=> spring cloud rest=> HTML view. Now the problem is that other fields of non-rowkey field query HBase very slowly, so I want to implement HBase + elastic search to achieve multi-condition fast query, but the biggest obstacle is how the data in HBase and elastic search are the same. Step? (transaction?) And it needs to be real-time.
I'd use a Kafka logstash input connector to read the stream as soon as possible.
In my opinion it's better to be as close as possible to the source of the event.

Transfer data from elasticsearch to kafka

I want to transfer data of certain conditions from elasticsearch to kafka.
Is there any way to do that?
I use logstash for transferring data from elasticsearch to kafka finally.
logstash is alse a common framework for ingesting transforming and stashing data, which offers a variety of input and output plugins including elasticsearch input and kafka output.
Besides, elasticsearch input plugins also provide a schedule mechanism, which is very convenient for ingesting incremental data.

Index large amount of data into elasticsearch

I've got more than 6 billion of social media data in HBase (including content/time/author and other possible fields) with 4100 regions in 48 servers , and I need to flush these data into Elasticsearch now.
I'm clear about the bulk API of ES, and using bulk in Java with MapReduce still cost many days (at least a week or so). I can use spark instead but I don't think it will help a lot.
I'm wondering is there any other tricks to write these large data into ElasticSearch ? Like manually write to es index files and using some kinds of recover to load the files in local file system ?
Appreciate any possible advice, thanks.
==============
Some details about my cluster environments:
spark 1.3.1 standalone (I can change it on yarn to use Spark 1.6.2 or 1.6.3)
Hadoop 2.7.1 (HDP 2.4.2.258)
ElasticSearch 2.3.3
AFAIK Spark is best option for indexing out of below 2 options.
along with that below are approaches I'd offer :
Divide (input scan criteria) and conquer 6 billion of social media data :
Id recommend create multiple Spark/Mapreduce jobs with different search criteria(to divide 6 billion of social media data in 6 pieces based on category or something else) and trigger them in parallel.
For example based on data capture Time Range(scan.setTimeRange(t1, t2)) or else with some fuzzy row logic(FuzzyRowFilter), should definitely speed up things.
OR
Kind of Streaming approach :
You can also consider as and when you are inserting data through spark or mapreduce you can simultaneously create indexes for them.
For example in case of SOLR : clouder has NRT hbase lily indexer... i.e as and when hbase table is populated based on WAL (write ahead log) entries simultaneously it will create solr indexes. check any thing is there like that for Elastic search.
Even if its not there for ES as well, don't have to bother, while ingesting data it self using Spark/Mapreduce program you can create by yourself.
Option 1 :
Id suggest if you are okay with spark it is good solution
Spark Supports native integration of ES from hadoop 2.1.
see
elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark, in the form of an RDD (Resilient
Distributed Dataset) (or Pair RDD to be precise) that can read data
from Elasticsearch. The RDD is offered in two flavors: one for Scala
(which returns the data as Tuple2 with Scala collections) and one for
Java (which returns the data as Tuple2 containing java.util
collections).
Samples here with different spark version from 1.3 on wards
More samples other than Hbase
Option 2 : As you are aware bit slow than spark
Writing data to Elasticsearch
With elasticsearch-hadoop, Map/Reduce jobs can write data to Elasticsearch making it searchable through indexes. elasticsearch-hadoop supports both (so-called) old and new Hadoop APIs.
I found a practical trick to improve the bulk index performance by myself.
I can calculate the hash routing in my client and make sure that each bulk request containing all index requests with the same routing. According to the routing result and shard info with ip, I directly send the bulk request to corresponding shard node. This trick can avoid the bulk reroute cost and cut down the bulk request thread pool occupation which may cause EsRejectedException.
For example, I have 48 nodes in different machines. Assuming that I send a bulk request containing 3000 index requests to any node, these index requests will be rerouted to other nodes (usually all the nodes) according by routing. And the client thread has to wait for the whole process finished, including processing local bulk and waiting for other nodes' bulk responses. However, without the reroute phase, the network costs are gone (except for forwarding to the replica nodes), and the client just need to wait less time. Meanwhile, assuming that I have only 1 replica, the total occupation of bulk threads are 2 only. ( client-> primary shard and primary shard -> replica shard )
Routing hash:
shard_num = murmur3_hash (_routing) % num_primary_shards
Try to take a look into: org.elasticsearch.cluster.routing.Murmur3HashFunction
Client can get the shards and index aliases by request to cat apis.
shard info url: cat shards
aliases mapping url: cat aliases
Some attentions:
ES may change default hash function in different version, which means the client code may not be version compatible.
This trick is based on the assumption that the hash results are basically balanced.
Client should think about fault tolerance such as connection timeout to the corredponding shard node.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

Resources