Index large amount of data into elasticsearch - elasticsearch

I've got more than 6 billion of social media data in HBase (including content/time/author and other possible fields) with 4100 regions in 48 servers , and I need to flush these data into Elasticsearch now.
I'm clear about the bulk API of ES, and using bulk in Java with MapReduce still cost many days (at least a week or so). I can use spark instead but I don't think it will help a lot.
I'm wondering is there any other tricks to write these large data into ElasticSearch ? Like manually write to es index files and using some kinds of recover to load the files in local file system ?
Appreciate any possible advice, thanks.
==============
Some details about my cluster environments:
spark 1.3.1 standalone (I can change it on yarn to use Spark 1.6.2 or 1.6.3)
Hadoop 2.7.1 (HDP 2.4.2.258)
ElasticSearch 2.3.3

AFAIK Spark is best option for indexing out of below 2 options.
along with that below are approaches I'd offer :
Divide (input scan criteria) and conquer 6 billion of social media data :
Id recommend create multiple Spark/Mapreduce jobs with different search criteria(to divide 6 billion of social media data in 6 pieces based on category or something else) and trigger them in parallel.
For example based on data capture Time Range(scan.setTimeRange(t1, t2)) or else with some fuzzy row logic(FuzzyRowFilter), should definitely speed up things.
OR
Kind of Streaming approach :
You can also consider as and when you are inserting data through spark or mapreduce you can simultaneously create indexes for them.
For example in case of SOLR : clouder has NRT hbase lily indexer... i.e as and when hbase table is populated based on WAL (write ahead log) entries simultaneously it will create solr indexes. check any thing is there like that for Elastic search.
Even if its not there for ES as well, don't have to bother, while ingesting data it self using Spark/Mapreduce program you can create by yourself.
Option 1 :
Id suggest if you are okay with spark it is good solution
Spark Supports native integration of ES from hadoop 2.1.
see
elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark, in the form of an RDD (Resilient
Distributed Dataset) (or Pair RDD to be precise) that can read data
from Elasticsearch. The RDD is offered in two flavors: one for Scala
(which returns the data as Tuple2 with Scala collections) and one for
Java (which returns the data as Tuple2 containing java.util
collections).
Samples here with different spark version from 1.3 on wards
More samples other than Hbase
Option 2 : As you are aware bit slow than spark
Writing data to Elasticsearch
With elasticsearch-hadoop, Map/Reduce jobs can write data to Elasticsearch making it searchable through indexes. elasticsearch-hadoop supports both (so-called) old and new Hadoop APIs.

I found a practical trick to improve the bulk index performance by myself.
I can calculate the hash routing in my client and make sure that each bulk request containing all index requests with the same routing. According to the routing result and shard info with ip, I directly send the bulk request to corresponding shard node. This trick can avoid the bulk reroute cost and cut down the bulk request thread pool occupation which may cause EsRejectedException.
For example, I have 48 nodes in different machines. Assuming that I send a bulk request containing 3000 index requests to any node, these index requests will be rerouted to other nodes (usually all the nodes) according by routing. And the client thread has to wait for the whole process finished, including processing local bulk and waiting for other nodes' bulk responses. However, without the reroute phase, the network costs are gone (except for forwarding to the replica nodes), and the client just need to wait less time. Meanwhile, assuming that I have only 1 replica, the total occupation of bulk threads are 2 only. ( client-> primary shard and primary shard -> replica shard )
Routing hash:
shard_num = murmur3_hash (_routing) % num_primary_shards
Try to take a look into: org.elasticsearch.cluster.routing.Murmur3HashFunction
Client can get the shards and index aliases by request to cat apis.
shard info url: cat shards
aliases mapping url: cat aliases
Some attentions:
ES may change default hash function in different version, which means the client code may not be version compatible.
This trick is based on the assumption that the hash results are basically balanced.
Client should think about fault tolerance such as connection timeout to the corredponding shard node.

Related

Should logs, metrics, and analytics all go to one data lake or be stored separately?

Background:
I am setting up my first elastic stack, and while I will be starting simple, I want to make sure I'm starting with good architecture. I would eventually like to have a solution for the following: hosting metrics, server logs (expressjs APM), single page app monitoring (APM RUM js agent), Redis metrics, MongoDB metrics, and custom event analytics (ie: sale, customer cancelled, etc).
Question:
Should I store all of this on one Elasticsearch cluster and use search to filter out the different cases, OR do I create a separate instance for each and keep them clearly defined to their roles.
(I would prefer the single data lake)
For logging use case:
you can store all the logs on a file system share before ingesting them into any search solution , so that you can re-ingest if needed
after storage , you can either ingest them into just one cluster with different indices , or to multiple clusters , its open choice , but it depends on the amount of data
if the size and compute of each justify a separatre ES cluster then do it , othervise , use a single cluster , with a failover cluster
For metrics:
you can directly ingest them into one cluster with different index patterns
if size and compute requirements justfies , make separate clusters
make a failover/backup cluster if needed
In both the cases , you will also need to store the cluster snapshots.
I personally recommend ELK for logging uses case , and Promethous for metrics.
Reporting/Analytics:
For some use cases like reporting/analytics on monthly and yearly basis , the log data will be huge , and you will need to ingest the data from the file share into hadoop to summerize it/ roll up based on some fields , and then , ingest the reduced data into ELK , this can reduce the size and compute requirements by 1000 factor.

Best ways to import data (initial load) from Oracle to Elastic Search

I am working on a project where I have two big tables (parent and child) in Oracle. One is having 65 Million and the other 80 Million records. In total, data from 10 columns are required from these tables and saved as one document into Elastic search. The load for two tables can be done separately also. What are two comparable options to move data (one time load) from these tables into Elastic search and out of the two which one would you recommend? The requirement is that it should be Fast and simple so that it can not only be used for one time data load but also be used in case there is a failure and the elastic search index needs to be created again from scratch.
As already suggested one option may be logstash: the advantage of logstash is simplicity, but it can be complicated to monitor and it can be difficult to configure if you have to transform some field during the ingestion.
One alternative can be nifi: it offers jdbc and elasticsearch plugin, and you can monitor, start and stop the ingestion directly with the web interface. It is possible with nifi to build a more complex and robust pipeline: handling exceptions, translating data types and performing data enrichment.

ElasticSearch on Cassandra data vs moving Cassandra data to ElasticSearch for Indexing

I'm new to ElasticSearch and am trying to figure out what is the most optimal way to index 1 Terabyte of data in Cassandra.
Two options that I understand right now are:
Move data periodically to ElasticSearch using the Cassandra-River plugin and then run index on the data.
Advantage: Search queries create no impact on Cassandra load
Disadvantage: Have to sync the data periodically
Without moving the data run ElasticSearch on Cassandra to index the data (not sure how will this be done).
Advantage: Data always in sync
Disadvantage: Impacts Cassandra performance ?
Any thoughts would be appreciated.
Prehaps in the context of ElasticSearch 1.4 and above.. just using ElasticSearch as a datastore and search engine might be simpler and elegant option.
Add more nodes to scale.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

How to build distribute search base on hadoop and lucene

I'm preparing to make distribute search module with lucence and hadoop but fell confused with something:
as we know , hdfs is a distribute file system ,when i put a file to hdfs , the file will be divided into severial blocks and stored in diffrent slave machine in the claster , but if i use lucene to write index on hdfs , i want to see the index on each machine , how to acheived it ?
i have read some of the hadoop/contrib/index and some katta ,but don't understand the idea of the "shards ,looks like part of the index" , it was stored on local disk of one computer or only one directionary distribut in the cluster ?
Thanks for advance
-As for your Question 1:
You can implement the Lucene "Directory" interface to make it work with with hadoop and let hadoop handle the files you submit to it. You could also provide your own implementation of "IndexWriter" and "IndexReader" and use your hadoop client to write and read the Index. This way since you could have more control about the format the index you will write. You can "see" or access the index on each machine via the your lucene/hadoop implementation.
-For your question 2:
A shard is a subset of the index. When you run your query all shards are processed in the same time and the results of the index search on all shards are combined. On each machine of your cluster you will have a part of your index: a shard. So a part of the index will be stored on a local machine but will appear to you as as a single file distributed across the cluster.
I can also suggest you to checkout the distributed search SolrCloud, or here
It is runs on Lucene as indexing/search engine and already enables you to have a clustered index. It also provides an API for submitting the files to index and query the index. Maybe it is sufficient for your use case.

Resources