How to create an index from search results, all on the server? - elasticsearch

I will be getting documents from a filtered query (quite a lot of documents). I will then immediately create an index from them (in Python, using requests to directly query the REST API), without any modification.
Is it possible to make this operation directly on the server, without the round-trip of data to the script and back?
Another question was similar (in the intent) and the only answer is to go via Logstash (equivalent to using my code, though possibly more efficient)

refer http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
in short what you need to do is
0.) ensure you have _source set to true
1.) use scan and scroll API , pass your filtered query with search type scan,
2.)fetch documents using scroll id
2.) bulk index the result using the source field which returns you the json used to index data
refer:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
guide/en/elasticsearch/guide/current/bulk.html
guide/en/elasticsearch/guide/current/reindex.html

es 2.3 has an experimental feature that allows reindex from a query
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Related

Query ElasticSearch after the index operation

I have the eservice A that executes some text processing. After it, service B has to execute some set of Elasticsearch queries on the document. The connectivity between the services provided by Kafka. The solution is tightly coupled to ES free text search capabilities, so I can't query in another way.
Possible solution:
To store the document in ES and query it. The problem is that ES is eventually consistent and I don't know if the document already indexed or not.
Is there some API to ensure that the document is already indexed?
Another option is to publish a message from service A with delay X+5 seconds, where X is the refresh interval of the index, where the document should be stored. Seems to me an unreliable solution. What do you think?
Another direction that I thought about, is some way to query the document with ES queries where the document is in memory. For example, if I will have some magic way to convert the ES query to Luciene DSL, so I don't need to deal with the eventual consistent behavior of Elasticsearch and I can query Lucine directly.
Maybe there are some other solutions?
take a look at the ?refresh flag so that an indexing request will only return once a refresh has happened. otherwise you can use the GET API to see if the document exists or not
however there is no magic options here, Elasticsearch is eventually consistent and you need to factor that in

Using stored_fields for retrieving a subset of the fields in Elastic Search

The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.

Set up Elasticsearch suggesters that can return suggestions from different data types

We're in the process of setting up Amazon Elasticsearch Service (running Elasticsearch version 2.3).
We have different types of data (that I'm currently thinking of as different document types within the same index).
We have a generic search in an app where we want an inline autocomplete function, that is, a completion suggester returning hits from all different data (document) types. How can that be set up?
When querying suggesters you have to specify an index, so that's why I wanted to keep all the data in the same index. According to the documentation, the completion suggester considers all documents in the index.
Setting up the completion suggester for the first document type was pretty straight forward and is working great. However, as far as I can see you to specify a suggest field when querying. That would be all good hadn't it been for the error message we get when setting up the mapping for the second document type:
Type: illegal_argument_exception Reason: "[suggest] is defined as an object in mapping [name_of_document_type] but this name is already used for a field in other types"
Writing this question I see that it's possible to specify more than one suggester in a single suggest query. Maybe that is what we have to solve it? (I.e. get X results from Y suggesters where we compare the scores to get the 1 suggestion we want to present to the user.)
One of the core principles of good data design for Elasticsearch (as with many data stores) is to optimise your data storage for ease of reading. Usually, this means embracing duplication.
With this in mind, I'd suggest having a separate autocomplete index with a mapping that's designed specifically for the suggester queries.
Whenever you insert or write one of your other documents, map it to your autocomplete type and add or update it in your autocomplete index at the same time (or, depending on how up-to-date it needs to be, create an offline process to update your autocomplete index e.g. every day).
Then, when you do your suggest query, you can just use your autocomplete index and not worry about dealing with different types of documents with different fields.

Using ElasticSearch Bulk to update and create documents dynamically?

I'm currently using elasticsearch and running a cron job every 10 minutes that will find newly created/updated data from my DB and sync it with elasticsearch. However, I want to use bulk to sync instead of making and arbitrary amount of requests to update/create documents in an index. I'm using the elasticsearch.js library created by elasticsearch.
I face 2 challenges that I'm uncertain about how to handle:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.
Not all SQL schemes make this easy, so for your specific questions:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:
index
create
update
delete
The first, index, is the most commonly used option. It means that you want to index (the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id, then it will replace it. The rest are probably a bit more obvious.
Each one of the sub-requests behaves like the individual option that they're associated with (so update is an UpdateRequest under the hood, delete is a DeleteRequest, and index is an IndexRequest). In the case of create, it is a specialization of index, which effectively says "add this if it doesn't exist, but fail it if is does exist".
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.

Do elasticsearch queries touch the DB?

Just starting to use elasticsearch with haystack in django using postgres, and I'm pretty happy with it so far.
I'm wondering if the search queries (filters) through ES will submit a query to the DB or do they use data gathered during indexing?
Given that I can delete the data in the DB and still search, the answer seems to be yes, the queries do not touch the DB but only touch the index.
Also, I found this documentation on the matter:
http://django-haystack.readthedocs.org/en/latest/best_practices.html#avoid-hitting-the-database
Further, this is also from the docs:
For example, one great way to leverage this is to pre-rendering an
object’s search result template DURING indexing. You define an
additional field, render a template with it and it follows the main
indexed record into the index. Then, when that record is pulled when
it matches a query, you can simply display the contents of that field,
which avoids the database hit.:

Resources