How A Solr Request Takes Too Much Time - performance

I have 3 nodes Solr data center. I am trying to redirect all queries to node1 using solr http api because i think i have problems with node2, node3. I will replace them. I enabled datastax solr slow query metric. I see two main problem.
Even though i set shard.shuffling.strategy=host
Document says that
host
Shards are selected based on the host that received the query.
and i expect that when request http://node1:8983/solr/.... , the coordinator_ip and node_ip columns in solr_slow_sub_query_log table will be the same. When i get the records, i see 80% percentage is node1. Is not that wrong? I expect 100% request use node1.
When i get records from solr_slow_sub_query_log, i see that the rows coordinator_id=node1 and node_ip=node2ornode3 has too much elapsed_millis such as 1300 seconds even though document says netty_client_request_timeout is 60 seconds.

Related

Stormcrawler, the status index and re-crawling

So we have stormcrawler running successfully, with the main index currently having a little over 2 million urls from our various websites indexed into it. That is working good, however SC doesn't seem to be re-indexing urls it indexed previously, and I am trying to sort out why.
I have tried searching for details on exactly how SC chooses it's next url from the status index. It does not seem to choose the oldest nextFetchDate, because we have docs in the status table with a nextFetchDate of Feb 3rd, 2019.
Looking through the logs, I see entries like:
2019-03-20 09:21:17.221 c.d.s.e.p.AggregationSpout Thread-29-spout-executor[17 17] [INFO] [spout #5] Populating buffer with nextFetchDate <= 2019-03-20T09:21:17-04:00
and that seems to imply that SC does not look at any url in the status table with a date in the past. Is that correct? If SC gets overwhelmed with a slew of urls and cannot crawl all of them by their nextFetchDate, do some fall through the cracks?
Doing a query for documents in the status index with a nextFetchDate of older than today, I see 1.4 million of the 2 million urls have a nextFetchDate in the past.
It would be nice if the crawler could fetch the url with the oldest nextFetchDate and start crawling there.
How to I re-queue up those urls that were missed on their nextFetchDate?
By default, the ES spouts will get the oldest records. What the logs are showing does not contradict that: it asks for the records with a nextFetchDate lower than 20th March for the shard #5.
The nextFetchDate should actually be thought of as 'don't crawl before date D', nothing falls through the cracks.
Doing a query for documents in the status index with a nextFetchDate of older than today, I see 1.4 million of the 2 million urls have a nextFetchDate in the past.
yep, that's normal.
It would be nice if the crawler could fetch the url with the oldest nextFetchDate and start crawling there.
that's what it does
How to I re-queue up those urls that were missed on their nextFetchDate?
they are not missed. They should be picked by the spouts
Maybe check that the number of spouts matches the number of shards you have on the status index. Each spout instance is in charge of a shard, if you have less instances than shards, then these shards will never be queried.
Inspect the logs for those particular URLs which should be fetched first: do they get sent by the spouts at all? You might need to turn the logs to DEBUG for that.

Investigating slow queries in ElasticSearch

We are using elastic search version 5.4.1 in our production environments. The cluster setup is 3 data, 3 query, 3 master nodes. Of late we are observing a lot of slow queries in a particular data node and the [index][shard] present in that are just replicas.
I don't find many deleted docs or memory issues that could directly cause the slowness.
Any pointers on how to go about the investigation here would be helpful.
Thanks!
Many things are happening during one ES query. First, check the took field returned by ElasticSearch.
took – time in milliseconds for Elasticsearch to execute the search
However, the took field is the time that it
took ES to process the query on its side. It doesn't include
serializing the request into JSON on the client
sending the request over the network
deserializing the request from JSON on the server
serializing the response into JSON on the server
sending the response over the network
deserializing the response from JSON on the client
As such, I think you should try to identify the exact step that is slow.
Reference: Query timing: ‘took’ value and what I’m measuring

Request Highlighting only for the final set of rows

In a multi-node solr installation (without SolrCloud), during a paging scenario (e.g., start=1000, rows=200), the primary node asks for 1200 rows from each shard. If highlighting is ON, then the primary node is asking for highlighting all the 1200 results from each shard, which doesn't scale well. Is there a way to break the shard query in two steps e.g. ask for the 1200 rows and after sorting the 1200 responses from each shard and finding final rows to return (1001 to 1200) , issue another query to shards for asking highlighted response for the relevant docs?
So, it turns out that Solr behavior has changed between my old version and 6.6. Although, this seems like a bug to me after the initial investigation.
I found that if i have specified fl=* in the query then it is doing the right thing (a 2 pass process as it used to do in solr 4.5). However, my queries have fl=id+score , in which case, the shards are asked for highlighting all the results on the first request (and there is no second request).
The fl=* query is (in my sample case) finishing in 100 msec while same query with fl=id+score finishes in 1200 msec.

How does elastic search brings back a node which is down

I was going through elastic search and wanted to get consistent response from ES clusters.
I read Elasticsearch read and write consistency
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-index_.html
and some other posts and can conclude that ES returns success to write operation after completing writes to all shards (Primary + replica), irrespective of consistency param.
Let me know if my understanding is wrong.
I am wondering if anyone knows, how does elastic search add a node/shard back into a cluster which was down transiently. Will it start serving read requests immediately after it is available or does it ensures it has up to date data before serving read requests?
I looked for the answer to above question, but could not find any.
Thanks
Gopal
If node is removed from the cluster and it joins again, Elasticsearch checks if the data is up to date. If it is not, then it will not be made available for search, until it is brought up to date again (which could mean the whole shard gets copied again).
the consistency parameter is just an additional pre-index check if the number of expected shards are available in the cluster (if the index is configured to have 4 replicas, then the primary shard plus two replicas need to be available, if set to quorum). However this parameter does never change the behaviour that a write needs to be written to all available shards, before returning to the client.

elasticsearch read timeout, seems to have too many shards?

I'm using both elasticsearch 1.4.4/2.1.0 with cluster of 5 hosts on AWS. In the config file, I've set index shard num to 10.
So here comes the strange behavior: I created a index everyday, and when there's about 400 shards or more, the whole cluster returns Read Timeout when using Buik index API.
If I delete some indices, the timeout error disappeared.
Anyone meets similar problem? This is really a big obstacle for storing more data

Resources