SolrCloud DIH performance - performance

Got Solr 6.4.2 running in SolrCloud and some doubts about indexing performance.
I am using MSSql as data source and newest JDBC driver for MSSQL.
When Solr is started as standalone my DataImport runs at 31250 docs/s
When Solr is started as SolrCloud (2 replicas) my DataImport runs at 10000 docs/s
Is there any config parameter which have influence on this?

It is expected that indexing in SolrCloud would be slower than indexing in standalone Solr (it has to index into the replicas too, so there is additional network traffic and latencies, and there are other things SolrCloud has to do too), but you can do some things to make sure it goes as fast as possible:
you can shard the index. Indexing into several shards should be faster (test diff. numbers, at some point it will be too many so don't go crazy)
send your docs to the shard leader. Indexing is done at the leader first, so if you send a doc to the leader you will save some network traffic. Of course here you have little control if you are using DIH. Unless you customize your DIH setup and have several handlers, each one would index only the docs for a shard, and you call each hanlder on the shard node.

Related

How Lucene Data Replication Works on Technologies Like ElasticSearch and Apache Solr

In a high-availability environment, how can these technologies replicate Lucene data? How could I do the replication of my Lucene directories, considering that today I do not use such technologies.
That question is probably too wide to answer anything useful, but in general you have two options:
Index the document to a master node, then replicate the index files that have changed to all other nodes. These are usually known as master/slave setups. The first versions of Solr used rsync to do this - that way Solr didn't have to know anything about replication itself. Later versions used HTTP to replicate the index files instead. If you already have a Lucene index that you want to make available on more nodes, this is the easiest solution that doesn't require fundamental changes to your project.
Distribute the document that's going to be added to the index to all known replicas of that index/shard. The indexing process happens on each node, and the document is distributed to the node before it has been added to the index. This is (simplified) what happens when Solr runs in cloud / cluster mode (and is what ES does as well IIRC). There's also transaction logs etc. involved here to make it more resilient to failure across nodes.
So either distribute the updates themselves or distribute the updated index.

setting up a basic elasticsearch cluster

Im new to elasticsearch and would like someone to help me clarify a few concepts
Im designing a small cluster with the following requirements
everything should still work when restarting one of the machines, one at a time (eg: OS updates)
a single disk failure is ok
heavy indexing should not impact query performance
How many master, data, ingest nodes should I have?
or do I need 2 clusters?
the indexing workload is purely indexing structured text documents, no processing/rules... do I even need an ingest node?
Also, does each node have a complete copy of the all the data? or only a cluster has the complete copy?
Be sure to read the documentation about Elasticsearch terminology at the very least.
With the default of 1 replica (primary shard and one replica shard) you can survive the failure of 1 Elasticsearch node (failed disk, restart, upgrade,...).
"heavy indexing should not impact query performance": You'll need to size your cluster correctly to handle both the indexing and searching. If you want to read current data and you do heavy updates, that will take up resources and you won't be able to fully decouple it.
By default every node is a data, ingest, and master-eligible node. The minimum HA setting needs 3 nodes. If you don't use ingest that's fine; it won't take up resources when you're not using it.
To understand which node has which data, you need to read up on the concept of shards. Basically every index is broken up into 1 to N shards (current default is 5) and there is one primary and one replica copy of each one of them (by default).

Shard Management in Lucene and Elasticsearch

Im facing the problem of setting up a production ready elasticsearch cluster.
At the moment im storing only the testing logfiles in elasticsearch.
So far so good, but since we have Production Logs of 1TB per Day
i was wondering how to setup an elasticsearch index properly for this use case.
We want to save these logs for 30Days. The Cluster Setup has 100TB Disk Space.
I would like to choose a Replica Count of 3, so the used disk space should be around 90TB.
But how many shards should i allocate?
Is there a difference between the Shards in Elastic and the Lucene Segments?
You should read article, that was sent by Val. But in case of logs you can create one index per day, this strategy can give you an ability to try different configurations.
Count of replicas should depend on count of your elasticsearch nodes.
You can also read this short article:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_how_primary_and_replica_shards_interact.html
And if you have doubts about count of replicas, this one can also help you:
https://codingexplained.com/coding/elasticsearch/understanding-replication-in-elasticsearch

Faster Logstash to Elastic indexing from flat files

I'm indexing JSON files out of S3 into Elastic with Logstash's S3 input plugin running on an EC2 T2.Medium instance. This works fine, but it's incredibly slow. I'm looking for some advice on faster ways of doing this as I realise multithreading with multiple Logstash instances out of S3 isn't an option.
My source data is actually in Google Big Query tables so if there was a way I could index from there that would be great, but I can't find a plugin or obvious way of doing this. I've been exploring the idea of pushing the BigQuery data into Redis first, but with the volume of data i'm looking to index i'm concerned this adds extra overhead, technical and cost wise that could be avoided.
My Elastic cluster is very simple, single node / single shard. I ran a test on a multi-node cluster to see if there were any indexing speed increased and it stayed the same. I'm using Elastic's hosted cloud service, formerly Found, so i'm not sure if that would have any bearing on this.
At present i'm happily indexing around 5M rows a day, albeit slowly. I'm aiming to be able to index around 100M per day in as quick a time as possible. At the current EPS, it'll take days!
Any general pointers would be much appreciated.

Which is better Apache solr or Elastic search?

I started creating my new search application. In my earlier application I used Apache solr. Now I want to know which better in terms of performance and usability.
Personally I want to know the performance benchmark of Elastic search and solr. If there are other alternatives suggestions are most welcome.
Disclaimer: I work at elasticsearch.com
I would just say: give elasticsearch a try. I think that after some hours (minutes?), you will have somehow an opinion.
Start 2 or 3 or 4 nodes, and you will see how things are rebalanced nicely.
About performance, I'd say that elasticsearch will give you a constant query throughput even if you are doing massive index operations.
I have used both quite a bit, and much prefer ElasticSearch. The API is more flexible and accessible. It is easier to get started with. Replication happens automatically by default. In general all the defaults are easier to work with. Everything generally works out of the box (safe defaults) and you only need to tune what you find needs to work better.
I have not worked much with SOLR 4, only with 3.x. Once I switched I never looked back, but I hear that there are many improvements in 4 with regards to replication and clustering that make it a usable competitor.
With regards to performance, I think that generally they are comparable as they both rely on Lucene. That is why there is a lack of valid benchmarks that make this general comparison. That said, there are certainly use cases where one will perform better than the other.
If you look at the trends of utilization while there are many more people currently using SOLR, it is in decline. That decline is very correlated to the increase in users of Elasticsearch which is very much on the rise. As Dadoonet said, give ElasticSearch a try, it won't take long and you won't want to use SOLR again.
UPDATE
I just spent two weeks on a client site consulting on a SOLR Cloud installation. I am now much more familiar with the updates to SOLR, and say quite confidently, I still prefer ElasticSearch, but it seems SOLR has some momentum again.
ElasticSearch, is hands down more elastic. That is, having an elastic cluster where nodes come and go, or even where you just need to add nodes is much much easier in ElasticSearch than SOLR. Anyone who tells you it is easy in SOLR, has not done it in ElasticSearch. ElasticSearch will automatically join a cluster and assume an active role in that cluster, taking over serving available shards and replicas. Over the last week I decommissioned a 2 node cluster, replacing it with two new nodes. I simply added the 2 new nodes, and one at a time, marked the other two nodes as non-data nodes. Once the shard migration completed I decommissioned the nodes. I had set minimum_master_nodes = 2 ((2/2)+1), and had no issue with split brain.
During the same week, I had to add a node to a SOLR cluster. The process was poorly documented, especially considering the changes from 4.1 to 4.3 and the mishmash of existing documentation, much of which says you can't even do it based on old versions of SOLR. I finally found documentation which clarified. It requires manually adding a core to the collection and then adding replicas to existing shards within the cluster. Finally you manually decommission the redundant shards on some other node. At some point this node may become master for one of those shards but not immediately.
With SOLR If you do not have sufficient shards to distribute, you can just add replicas or you can go through a shard split to create two new shards. Again this is a poorly documented feature, but is functionality that does not exist in ElasticSearch. You must split and then remove the original shard, something none of the documentation clearly explains.
SolrCloud has a couple other advantages as well if integrating with Hadoop. If you are indexing data in HDFS or HBase, there are now both Map-Reduce, and real time methods of ingesting data into SOLR. This provides some real power to your Big Data platform and allows you to do full text search over data that is otherwise barely accessible.
While you can index Hadoop data into ElasticSearch, the implementation is not as clean as the SolrCloud/Cloudera Search implementations. Having the MapReduce directly build the shards is a far superior solution with significant performance benefits. Reducers talking directly to a cluster works, but it is not the same. I do not know if anything similar to the Lily connector for HBase exists for ElasticSearch, if not I may look into writing one. This allows indexing directly from the HBase replication logs.
So in summary there are certainly situations where either is beneficial. If you are looking for tight integration with Hadoop, SOLR, ClouderaSearch specifically, is a good option. If you are looking for ease in managing an Elastic cluster, Elasticsearch will be a much better option. For me, I'll continue with my hacky Hadoop integrations to make it work with Elasticsearch, until something better emerges.

Resources