Multinode couchbase cluster performance issue - spring

We have a Couchbase cluster running consisting of 3 Nodes.
Two nodes have enabled the data, index, query and search service,
The third node is a data only service.
When a "larger" dataset of ~400 entries are created, it takes up to 15 minutes until the documents can be fully queried.
The cluster is accessed by Spring-Data repositories and the Couchbase-Java-Client shipped with Spring-Data-Couchbase only(see versions below).
Performing the same request in our staging environment with a single node cluster and the same GSI Index, the data is, compared to the production state, instantaneously available. So my conclusion would be, that there is an issue with the node sync or the caching in Spring-Data-Couchbase.
Is there a configuration I miss, that would speed up the node sync or anyone else facing the same problem?
Versions:
Couchbase Server 6.0.0 Community
Spring-Boot 2.2.4
Spring-Data-Couchbase 3.2.4
Couchbase Java Client 2.7.11

I suggest that, for one node set only data service and increase the memory size as much as you can.
for 1.node set only data service
for 2.node set data and index service
for 3.node set search and query
if you are not using search service set nodes like these
for 1.node set only data service
for 2.node set only data service
for 3.node set index and query

Related

migrate indexes from old version of elasticsearch to elasticsearch 7.9

we want to upgrade our elasticsearch version from 5.6 to 7.9 in our project.
I have to migrate our indexes and docs to new version but I cant use reindex, So I rest high level client to connect to elasticsearch 7 and use http request for elasticsearch 5.
For migration I get part of docs with match_all query and scroll from old version and index them in new elasticsearch with bulk request.
our old version elasticsearch has 3 node.My question is that I have to send request to all node separately and process docs or if I send match_all query search to one node it will be handled by elsaticsearch (I read sth about cordinating node that handle requests and Every node is implicitly a coordinating node cordinating node.) or I have to send request to data node
Adding more details to #saeednasehi answer, Looks like you are getting confused about how Elasticsearch and its queries work internally, please refer to my answer to how search queries works in elasticsearch.
Apart from this while it's true, you can get data by connecting to any node, but in your ES client(JHLRC or HTTP) you should mention all the nodes IP, so that your request(note coordinating) load is distributed among all the data nodes, if you just give one node-IP, than that node always acts as a co-ordinating node in absence of dedicated coordinating node(default).
When you start a cluster of elsticsearch you can see all of the cluster as a single data base. it means that you can fetch and insert to all of the cluster by sending your request to one of them. You just need to send your request to a node and fetch your data.

Index large amount of data into elasticsearch

I've got more than 6 billion of social media data in HBase (including content/time/author and other possible fields) with 4100 regions in 48 servers , and I need to flush these data into Elasticsearch now.
I'm clear about the bulk API of ES, and using bulk in Java with MapReduce still cost many days (at least a week or so). I can use spark instead but I don't think it will help a lot.
I'm wondering is there any other tricks to write these large data into ElasticSearch ? Like manually write to es index files and using some kinds of recover to load the files in local file system ?
Appreciate any possible advice, thanks.
==============
Some details about my cluster environments:
spark 1.3.1 standalone (I can change it on yarn to use Spark 1.6.2 or 1.6.3)
Hadoop 2.7.1 (HDP 2.4.2.258)
ElasticSearch 2.3.3
AFAIK Spark is best option for indexing out of below 2 options.
along with that below are approaches I'd offer :
Divide (input scan criteria) and conquer 6 billion of social media data :
Id recommend create multiple Spark/Mapreduce jobs with different search criteria(to divide 6 billion of social media data in 6 pieces based on category or something else) and trigger them in parallel.
For example based on data capture Time Range(scan.setTimeRange(t1, t2)) or else with some fuzzy row logic(FuzzyRowFilter), should definitely speed up things.
OR
Kind of Streaming approach :
You can also consider as and when you are inserting data through spark or mapreduce you can simultaneously create indexes for them.
For example in case of SOLR : clouder has NRT hbase lily indexer... i.e as and when hbase table is populated based on WAL (write ahead log) entries simultaneously it will create solr indexes. check any thing is there like that for Elastic search.
Even if its not there for ES as well, don't have to bother, while ingesting data it self using Spark/Mapreduce program you can create by yourself.
Option 1 :
Id suggest if you are okay with spark it is good solution
Spark Supports native integration of ES from hadoop 2.1.
see
elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark, in the form of an RDD (Resilient
Distributed Dataset) (or Pair RDD to be precise) that can read data
from Elasticsearch. The RDD is offered in two flavors: one for Scala
(which returns the data as Tuple2 with Scala collections) and one for
Java (which returns the data as Tuple2 containing java.util
collections).
Samples here with different spark version from 1.3 on wards
More samples other than Hbase
Option 2 : As you are aware bit slow than spark
Writing data to Elasticsearch
With elasticsearch-hadoop, Map/Reduce jobs can write data to Elasticsearch making it searchable through indexes. elasticsearch-hadoop supports both (so-called) old and new Hadoop APIs.
I found a practical trick to improve the bulk index performance by myself.
I can calculate the hash routing in my client and make sure that each bulk request containing all index requests with the same routing. According to the routing result and shard info with ip, I directly send the bulk request to corresponding shard node. This trick can avoid the bulk reroute cost and cut down the bulk request thread pool occupation which may cause EsRejectedException.
For example, I have 48 nodes in different machines. Assuming that I send a bulk request containing 3000 index requests to any node, these index requests will be rerouted to other nodes (usually all the nodes) according by routing. And the client thread has to wait for the whole process finished, including processing local bulk and waiting for other nodes' bulk responses. However, without the reroute phase, the network costs are gone (except for forwarding to the replica nodes), and the client just need to wait less time. Meanwhile, assuming that I have only 1 replica, the total occupation of bulk threads are 2 only. ( client-> primary shard and primary shard -> replica shard )
Routing hash:
shard_num = murmur3_hash (_routing) % num_primary_shards
Try to take a look into: org.elasticsearch.cluster.routing.Murmur3HashFunction
Client can get the shards and index aliases by request to cat apis.
shard info url: cat shards
aliases mapping url: cat aliases
Some attentions:
ES may change default hash function in different version, which means the client code may not be version compatible.
This trick is based on the assumption that the hash results are basically balanced.
Client should think about fault tolerance such as connection timeout to the corredponding shard node.

Adding cluster to existing elastic search in elk

Currently I have existing
1. Elastic search
2. Logstash
3. Kibana
I have existing data on them.
Now i have setup ELK cluster with 3 Master nodes , 5 data nodes 3 client nodes.
But i am not sure how can i get existing data into them.
Is it possible that if i make the existing ES node as data node and then attach it to the cluster . Then will that data gets replicated to other data nodes as well? and then take that node offline
Option 1
How about just try with fewer nodes? It is not hard to test if it is supported if you setup one node, feed some data, and add one more and configure them as a cluster to see if data get synchronized.
Option 2
Another option is to use an elasticsearch migration tool like https://github.com/taskrabbit/elasticsearch-dump, basically, you could setup a clean cluster and migrate all your data in old node to this cluster.

ElasticSearch on Cassandra data vs moving Cassandra data to ElasticSearch for Indexing

I'm new to ElasticSearch and am trying to figure out what is the most optimal way to index 1 Terabyte of data in Cassandra.
Two options that I understand right now are:
Move data periodically to ElasticSearch using the Cassandra-River plugin and then run index on the data.
Advantage: Search queries create no impact on Cassandra load
Disadvantage: Have to sync the data periodically
Without moving the data run ElasticSearch on Cassandra to index the data (not sure how will this be done).
Advantage: Data always in sync
Disadvantage: Impacts Cassandra performance ?
Any thoughts would be appreciated.
Prehaps in the context of ElasticSearch 1.4 and above.. just using ElasticSearch as a datastore and search engine might be simpler and elegant option.
Add more nodes to scale.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

Resources