Best solution for changing settings in Elasticsearch - elasticsearch

I have 2 question have indexed: FIRE_DETECTED and SMOKE:DETECTED in Elasticsearch
Goal
I want to search with query = 'fire' -> result: FIRE_DETECTED
query = 'dectected' -> result: FIRE_DETECTED and SMOKE:DETECTED
Some solution
Add more setting in analyzer
We need to create a new index with new setting (Add Token filter: word_delimiter_graph)
Reindex
Problem: How to add setting in production without effect customer?
Add 1 more field into Elasticsearch filterd_question
Split data with : and _
Save splited data in this filterd_question field
Problem: We need 1 more field
What is the best solution for this? (Add more solutions if need)

Again, this is really good and very common scenario while working with elasticsearch and as requirements keeps changes and in order to support them, we have to change the way we index the data in ES.
Both the approaches which you mentioned are used by companies and they both have their trad-offs and you have to choose one which suits according to your requirements.
Change/add the analyzer will require below steps in order to make it work:
Close the index
Add/Edit the analyzer definition.
open the index
Reindex all the documents(you should use the index alias with zero down time to efficiently do it and minimize the impact on end-users)
After step-4, your new searches, will work.
Pros: it won't create new fields, hence would save the space, hence more efficient and cleaner way of doing this change.
cons would be that re-index might take a lot of time, based on number of documents and its comparatively complex process.
Add a custom-analyzer and then add a new field using newly added analyzer
In this case also, it requires closing/opening a index, unless you are using the inbuilt analyzer, but in this case, your new documents or documents which are updating will have the new fields, so your search according to new analyzer/logic will bring partial results, but this is could be fine based on your use-case.
Pros: relatively simpler approach and doesn't require full-re indexing in all the cases.
Cons: extra space, if old field is not being used and complexity varies according to use-cases.

If you don't want to change/add analyzer. You can try using wildcard query. Although the con would be performance.

Related

Is it possible to update indices in Elasticsearch with zero downtime?

I am trying to achieve updating the data in the indices of Elasticsearch with zero downtime but I am not sure how to achieve this. Can anyone assist me how I can do so?
For example: If I have an index of name my_es_index and I want to update the data in that particular index with zero downtime so that the old data is still there on one of the node while someone is performing certain query but parallelly in the backend , we are updating the data on that index.
Is it possible to achieve? If yes, please help me with how I can proceed.
You build/create another index (we called new index), then switch from old index to new index, then delete old index.
Read more at https://medium.com/craftsmenltd/rebuild-elasticsearch-index-without-downtime-168363829ea4
Unless you are to update the mapping of an existing field and preserving the name of the fields is required, I don't think taking the cluster down is needed.
While the above article is a good read and might be treated as best practices, ES is a lot flexible. Unlike MySQL/SQL, it allows you to update existing documents.
Adding a new field
Let's call the new field to be added as x.
add mapping to the index for x.
make the code changes such that going forward, all the new documents have this new field x.
while all the new documents have the field x, write-up a script which updates the older documents and adds this field x.
once you are sure that all the documents have the field x, you may enable the feature you added this field for.
Updating mapping of a field
Let's again call the field to be updated as x (assuming the name of the field is not the prime concern).
create a new field, say new_x (add correct mapping to the index).
follow the above steps to ensure new_x has the data (slight change that we need to ensure both x and new_x have this data).
once all the documents in the index have the field new_x, simply refactor the code to use new_x instead of x.
While one might argue that above two approaches are in a way hacks, it saves you time, effort and cost to boot up a new ES instance and manage the aliases.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

ElasticSearch multiple types with same mapping in single index

I am designing an e-Commerce site with multiple warehouse. All the warehouses have same set of products.
I am using ElasticSearch for my search engine.
There are 40 fields each ES document. 20 out of them will differ in value per warehouse, rest 20 fields will contain same values for all warehouses.
I want to use multiple types (1 type for each warehouse) in 1 index. All of the types will have same mappings. Please advise if my approach is correct for such scenario.
Few things not clear to me,
Will the inverted index be created only once for all types in same index?
If new type (new warehouse) is added in future how it will be merged with the previously stored data.
How it will impact the query time if I would have used only one type in one index.
Depending on all types being assigned to the same index, it will only created once and
If a new type is added, its information is added to the existing inverted index as well - adding new terms to the index, adding pointers to existing terms in the index, adding data to doc values per new inserted document.
I honestly can't answer that one, though it is simple to test this in a proof of concept.
In my previous project, I experienced the same setting implementing a search engine with Elasticsearch on a multishop-platform. In that case we had all shops in one type and when searching per shop relevant filters were applied. Though, the approach to separate shop-data by "_type" seems pretty clean to me. We applied it the other way, since my implementation was already able to cover it by filters at the moment of the feature request.
Cheers, Dominik

Set up Elasticsearch suggesters that can return suggestions from different data types

We're in the process of setting up Amazon Elasticsearch Service (running Elasticsearch version 2.3).
We have different types of data (that I'm currently thinking of as different document types within the same index).
We have a generic search in an app where we want an inline autocomplete function, that is, a completion suggester returning hits from all different data (document) types. How can that be set up?
When querying suggesters you have to specify an index, so that's why I wanted to keep all the data in the same index. According to the documentation, the completion suggester considers all documents in the index.
Setting up the completion suggester for the first document type was pretty straight forward and is working great. However, as far as I can see you to specify a suggest field when querying. That would be all good hadn't it been for the error message we get when setting up the mapping for the second document type:
Type: illegal_argument_exception Reason: "[suggest] is defined as an object in mapping [name_of_document_type] but this name is already used for a field in other types"
Writing this question I see that it's possible to specify more than one suggester in a single suggest query. Maybe that is what we have to solve it? (I.e. get X results from Y suggesters where we compare the scores to get the 1 suggestion we want to present to the user.)
One of the core principles of good data design for Elasticsearch (as with many data stores) is to optimise your data storage for ease of reading. Usually, this means embracing duplication.
With this in mind, I'd suggest having a separate autocomplete index with a mapping that's designed specifically for the suggester queries.
Whenever you insert or write one of your other documents, map it to your autocomplete type and add or update it in your autocomplete index at the same time (or, depending on how up-to-date it needs to be, create an offline process to update your autocomplete index e.g. every day).
Then, when you do your suggest query, you can just use your autocomplete index and not worry about dealing with different types of documents with different fields.

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

Resources