Signalling end of indexing in elastic for ReactiveElasticsearchRepository? - elasticsearch

I insert multiple documents using ReactiveElasticsearchRepository.saveAll.
However, when I want to search just after that using findByName on my repository I get no results.
Is it possible to somehow wait/query the indexing process to finish?

Related

How about including JSON doc version? Is it possible for elastic search, to include different versions of JSON docs, to save and to search?

We are using ElasticSearch to save and manage information on complex transactions. We might need to add more information for every transaction, on the near future.
How about including JSON doc version?
Is it possible for elastic search, to include different versions of JSON docs, to save and to search?
How does this affects performance on ElasticSearch?
It's completely possible, By default elastic uses the dynamic mappings for every new documents such as your JSON documents to index them. For each field in your documents elastic creates a table called inverted_index and the search queries executed against them so regardless of your field variation as long as you know which field you want to execute query the data throughput and performance will not be affected.

Stormcrawler - how does the es.status.filterQuery work?

I am using stormcrawler to put data into some Elasticsearch indexes, and I have a bunch of URL's in the status index, with a variety of statuses - DISCOVERED, FETCHED, ERROR, etc.
I was wondering if I could tell StormCrawler to just crawl the urls that are https and with the status: DISCOVERED and if that would actually work. I have the es-conf.yaml set as follows:
es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"
Is that correct? how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?
See code of the AggregationSpout.
how does SC make use of the es.status.filterQuery? Does it run a
search and apply the value as a filter to retrieve only the applicable
documents to fetch?
yes, it filters the queries sent to the ES shards. This is useful for instance to process a subset of a crawl.
It is a positive filter i.e. the documents must match the query in order to be retrieved; you'd need to remove the - for it to do what you described.

Using job to delete data from Elastic seach asynchronously

I would like to delete data from Elastic search using API (curl).
I would like to start the deletion process and later query about the progress of deletion process.
Is it possible to use job to do it?
I tried looking at relevant documentation but the amount of examples is very low.
Would appreciate any relevant information or links.
You have two solutions:
Using the delete-by-query API using a range query that you can then monitor using the Task API
Use daily indices (e.g. my-logs-2018-09-10, my-logs-2018-09-11, etc) so deleting data in the past is simply a matter of deleting the indices for the days you want to ditch. No need to monitor as this happens instantaneously

How to create an index from search results, all on the server?

I will be getting documents from a filtered query (quite a lot of documents). I will then immediately create an index from them (in Python, using requests to directly query the REST API), without any modification.
Is it possible to make this operation directly on the server, without the round-trip of data to the script and back?
Another question was similar (in the intent) and the only answer is to go via Logstash (equivalent to using my code, though possibly more efficient)
refer http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
in short what you need to do is
0.) ensure you have _source set to true
1.) use scan and scroll API , pass your filtered query with search type scan,
2.)fetch documents using scroll id
2.) bulk index the result using the source field which returns you the json used to index data
refer:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
guide/en/elasticsearch/guide/current/bulk.html
guide/en/elasticsearch/guide/current/reindex.html
es 2.3 has an experimental feature that allows reindex from a query
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

solr query- get results without scanning files

I would like to execute a solr query and get only the uniquKey I've defined.
The documents are very big so defining fl='my_key' is not fast enough - all the matching documents are still scanned and the query can take hours (even though the search itself was fast - numFound takes few seconds to return).
I should mention that all the data is stored, and creating a new index is not an option.
One idea I had was to get the docIds of the results and map them to my_key in the code.
I used fl=[docid], thinking it doesn't need scanning to get this info, but it still takes too long to return.
Is there a better way to get the docIds?
Or a way to unstore certain fields without reindexing?
Or perhapse a compeletly different way to get the results without scanning all the fields?
Thanks,
Dafna
Sorry, but the only way is to break your gigantic documents in more than one. I don't see how it will be possible to only match the fields you specified and let the documents alone. This is not how Lucene works.
One could make a document that used only indexed fields that are needed to query to turn the job easier, or break the document based on the queries that are needed. Or simply adding another documents with the structure needed for these new queries. It's up to you.

Resources