Stormcrawler, the status index and re-crawling - elasticsearch

So we have stormcrawler running successfully, with the main index currently having a little over 2 million urls from our various websites indexed into it. That is working good, however SC doesn't seem to be re-indexing urls it indexed previously, and I am trying to sort out why.
I have tried searching for details on exactly how SC chooses it's next url from the status index. It does not seem to choose the oldest nextFetchDate, because we have docs in the status table with a nextFetchDate of Feb 3rd, 2019.
Looking through the logs, I see entries like:
2019-03-20 09:21:17.221 c.d.s.e.p.AggregationSpout Thread-29-spout-executor[17 17] [INFO] [spout #5] Populating buffer with nextFetchDate <= 2019-03-20T09:21:17-04:00
and that seems to imply that SC does not look at any url in the status table with a date in the past. Is that correct? If SC gets overwhelmed with a slew of urls and cannot crawl all of them by their nextFetchDate, do some fall through the cracks?
Doing a query for documents in the status index with a nextFetchDate of older than today, I see 1.4 million of the 2 million urls have a nextFetchDate in the past.
It would be nice if the crawler could fetch the url with the oldest nextFetchDate and start crawling there.
How to I re-queue up those urls that were missed on their nextFetchDate?

By default, the ES spouts will get the oldest records. What the logs are showing does not contradict that: it asks for the records with a nextFetchDate lower than 20th March for the shard #5.
The nextFetchDate should actually be thought of as 'don't crawl before date D', nothing falls through the cracks.
Doing a query for documents in the status index with a nextFetchDate of older than today, I see 1.4 million of the 2 million urls have a nextFetchDate in the past.
yep, that's normal.
It would be nice if the crawler could fetch the url with the oldest nextFetchDate and start crawling there.
that's what it does
How to I re-queue up those urls that were missed on their nextFetchDate?
they are not missed. They should be picked by the spouts
Maybe check that the number of spouts matches the number of shards you have on the status index. Each spout instance is in charge of a shard, if you have less instances than shards, then these shards will never be queried.
Inspect the logs for those particular URLs which should be fetched first: do they get sent by the spouts at all? You might need to turn the logs to DEBUG for that.

Related

How to keep track of elasticsearch requests

In elastic cluster I have 2 indices. I need to keep track of the requests that come to these indices. For example I have customer and product indices. When a new customer document added to customer index, I need to get the id of the document that added and its body.
Another example when a product document is updated I also need the id of that product and its body or what changed in that document.
My elasticsearch version is 7.17
(I am writing in node.js if you have an code examples or solution I would be appreciated)
you can do this via the Elasticsearch slow log, where you reduce the timing to a 0 so it tracks everything, or via some other proxy that intercepts the requests. Elasticsearch doesn't do this out of the box though unfortunately

Elasticsearch : UpdateByQuery API Response returns wrong status

I am facing issue with UpdateByQuery API while trying to update a document which doesn’t exist in Elastic search
Problem description
We are creating one index for each day like test_index-2020.03.11, test_index-2020.03.12… and we maintain eight days (today’s as well as last week seven days) indexes.
When data arrives (reading one by one or in a bulk from Kafka topic) either we need to update (which may exist in any one of the 8 days indexes) if data already exists with given ID or save it if not exist (to current day index).
The solution, I am trying currently when data arrive one by one:
Using UpdateByQuery with an inline script to update the doc
If BulkByScrollResponse returns Updated count 0, then save the doc
Issues:
Even if doc doesn’t exist still I can see BulkByScrollResponse returns updated field as non-zero (1,2,3,4…) as follows
BulkIndexByScrollResponse[sliceId=null,updated=1,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s]
Due to this unable to trigger document save request.
How to approach if the bulk of documents (having set of different doc IDs) need to be updated with their respective content with single request? Will I be able to achieve with UpdateByQuery?
Note: Considering the amount of data to be processed per hour we need to avoid multiple hits to Elasticsearch.
Doc ID is in the format of
str1:str2:Used:Sat Mar 14 23:34:39 IST 2020
But even if doc doesn't exist still i can see updated count as non zero
Adding couple of more points about the approach i am trying:
-In my case there is always only one doc which has to get updated per request, as i am trying to update the doc matching the given ID
-We have configured shards and replica as
"number_of_shards": 10,
"number_of_replicas": 1
-We are going with this approach as we don't know in which index actual doc resides
If there is maximum one document matching then Updated field of the response should not have more than 1
Following are couple of output which i get as a part of response:
BulkIndexByScrollResponse[sliceId=null,updated=9,created=0,deleted=0,batches=1,versionConflicts=1,noops=0,retries=0,throttledUntil=0s]
BulkIndexByScrollResponse[sliceId=null,updated=10,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s]

How to check the index is used for searching or indexing

I've a lot of elasticsearch clusters which hold the historical indices(more than 10 years old), some of these indices are created newly with latest settings and fields, but old ones are not deleted.
Now I need to delete the old indices which are not receiving any search and index requests.
I've already gone to elasticsearch curator but it would not work with older version of ES.
Is there is any API which can just gives the last time of index and search request in ES, that would serve my purpose very well.
EDIT:- I've also check https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html but this also doesn't give the last time when indexing or search request came. all it gave is the number of these requests from last restart.

Request Highlighting only for the final set of rows

In a multi-node solr installation (without SolrCloud), during a paging scenario (e.g., start=1000, rows=200), the primary node asks for 1200 rows from each shard. If highlighting is ON, then the primary node is asking for highlighting all the 1200 results from each shard, which doesn't scale well. Is there a way to break the shard query in two steps e.g. ask for the 1200 rows and after sorting the 1200 responses from each shard and finding final rows to return (1001 to 1200) , issue another query to shards for asking highlighted response for the relevant docs?
So, it turns out that Solr behavior has changed between my old version and 6.6. Although, this seems like a bug to me after the initial investigation.
I found that if i have specified fl=* in the query then it is doing the right thing (a 2 pass process as it used to do in solr 4.5). However, my queries have fl=id+score , in which case, the shards are asked for highlighting all the results on the first request (and there is no second request).
The fl=* query is (in my sample case) finishing in 100 msec while same query with fl=id+score finishes in 1200 msec.

How A Solr Request Takes Too Much Time

I have 3 nodes Solr data center. I am trying to redirect all queries to node1 using solr http api because i think i have problems with node2, node3. I will replace them. I enabled datastax solr slow query metric. I see two main problem.
Even though i set shard.shuffling.strategy=host
Document says that
host
Shards are selected based on the host that received the query.
and i expect that when request http://node1:8983/solr/.... , the coordinator_ip and node_ip columns in solr_slow_sub_query_log table will be the same. When i get the records, i see 80% percentage is node1. Is not that wrong? I expect 100% request use node1.
When i get records from solr_slow_sub_query_log, i see that the rows coordinator_id=node1 and node_ip=node2ornode3 has too much elapsed_millis such as 1300 seconds even though document says netty_client_request_timeout is 60 seconds.

Resources