Below are our master and slave (14 slaves) configurations
Master
<requestHandler name="/replication"
class="MyCustomizedReplicationHandler" >
<lst name="master">
<str name="replicateAfter">optimize</str>
<str name="replicateAfter">startup</str>
</lst>
</requestHandler>
Slave
<requestHandler name="/replication"
class="MyCustomizedReplicationHandler" >
<lst name="slave">
<str name="masterUrl">http://${masterHostName}/solr-master-4.2.0/${solr.core.name}</str>
<str name="pollInterval">04:00:00</str>
</lst>
</requestHandler>
We have scheduled a batch to run every day and the response time during the very first replication is high as there will not be any changes in the following replications of the same day.
Solr Query
(AC_SEARCH:(belfast*) AND (TYPE:(ARP) OR HAS_AIRPORTS:(true)))
I gave direct Solr query here but we use Solr Client to communicate with Solr from application.
The same kind of query's response time is different during replication time and non-replication time.
Please help me to fix this, anything do I need to change in the configuration to achieve this.
Related
Context
We have two Elasticsearch clusters with 6 and 3 nodes each. The cluster with 6 nodes is the one we use in production environment and we use the one with 3 nodes for testing purposes. (We have the same problem in both clusters). All the nodes have the following characteristics:
Elasticsearch 7.4.2
1TB HDD disk
8 GB RAM
In our case, we need to reindex some of the indexes. Those indexes have billions of documents and a size between 50GB and 250GB.
Problem
Whenever we start reindexing, internally or from a remote source, the task starts working correctly but it reaches a point where it stops reindexing, without apparent reason. We canĀ“t see anything in the logs. The task is not cancelled or anything, it only stops reindexing documents, it looks like the task gets stuck. We tried changing GC strategies, we used CMS and Shenandoah but nothing changes.
Has anyone run into the same problem?
It's difficult to find the RCA of these issues without debugging it and with the little information you provided(missing cluster and index configuration, index slow logs information, elasticsearch error logs, Elasticsearch hot threads to name a few).
I have 3 nodes Solr data center. I am trying to redirect all queries to node1 using solr http api because i think i have problems with node2, node3. I will replace them. I enabled datastax solr slow query metric. I see two main problem.
Even though i set shard.shuffling.strategy=host
Document says that
host
Shards are selected based on the host that received the query.
and i expect that when request http://node1:8983/solr/.... , the coordinator_ip and node_ip columns in solr_slow_sub_query_log table will be the same. When i get the records, i see 80% percentage is node1. Is not that wrong? I expect 100% request use node1.
When i get records from solr_slow_sub_query_log, i see that the rows coordinator_id=node1 and node_ip=node2ornode3 has too much elapsed_millis such as 1300 seconds even though document says netty_client_request_timeout is 60 seconds.
I am currently assessing if we can move our Solr based backend to Elasticsearch.
However, something I can't seem to work out is if there is an equivalent capability of a custom request handler configure in Solr (as would be configured in the solrconfig.xml) in Elasticsearch.
For context, in our Solr configuration, we have a number of statically defined request handlers with a set of pre-configured facets, ranged facets, facet pivots. Something akin to the below, configured in solrconfig.xml:
<requestHandler name="/foo" class="solr.SearchHandler">
<lst name="defaults">
<str name="fl">
field1,
field2
</fl>
<str name="facet.field">bar</str>
<str name='facet.range'>range_facet</str>
<str name='f.range_facet.facet.range.start'>0</str>
<str name='f.range_facet.facet.range.end'>10</str>
<str name='f.range_facet.facet.range.gap'>1</str>
</lst>
</requestHandler>
I could then GET a set of documents directly from that RequestHandler with something like this http://solr-host:8983/solr/collection-name/foo?q=*:*
and Solr would return a document set with only the desired field and facets. Fundamentally, the application executing the query does not need to be aware of (or configured to) request all returned elements at the time of query.
My question is this - in Elasticsearch, is there an ability to configure an endpoint that would return only the desired aggregations and/or fields without having to post those to the API at the time of the query?
There is a good article for this, https://sematext.com/blog/2014/04/29/parametrizing-queries-in-solr-and-elasticsearch/ . Elastic Search basically uses templates in place of handlers to make query calls associated with search .There are number of stored templates available for use too . See the documentation here Template Query
I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too
We use ElasticSearch for our tool's real time metrics and analytics part. ElasticSearch is very cool and fast when we are query our data. (statiticial facets and terms facet)
But we have problem when we try to index our hourly data. We collect every our metric data from other services. First we collect data from other services and save them RabbitMQ process. But when queue worker runs our all hourly data not index to ES. Usually %40 of data index in ES and other them lost.
So what is your idea about when index ES under high traffic ?
I've posted answers to other similar questions:
Ways to improve first time indexing in ElasticSearch
Performance issues using Elasticsearch as a time window storage (latter part of my answer applies)
Additionally, instead of a custom 'queue worker' have you considered using a 'river'? For more information see:
http://www.elasticsearch.org/blog/the-river/
http://www.elasticsearch.org/guide/reference/river/