Elasticsearch - Test new analyzers against an existing data set - elasticsearch

New to Elasticsearch.
I need to update an index to treat both plurals & singulars as matches. So green apple should match green apples and well (and vice versa).
Through my research, I understand I need to recreate the index with a stemmer filter.
So:
"analysis": {
"analyzer": {
"std_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ]
}
}
}
Can anyone confirm if the above is correct? If not, what will I need to use?
I also understand that I cannot modify the existing index, but rather I will need to create a new one with this analyzer, and then re-add all the documents to the new index. Is that correct? If so, is there a shortcut or easy way to tell it to "add all documents from index X to new index Y?"
Thank you for your help

Find inline answers
In most of the cases, it should work, and also its really difficult to cover all the future use-cases and in your case we don't even know your current use-cases, you can use Analyze API and test some of your use-case, before pushing these analyzer related changes to production.*
Adding/changing the Analyzer is a breaking change as it controls how the tokens are generated and indexed in the elasticsearch inverted index, hence you have to create reindex all the documents with updated Analyzer setting, you can use the reindex API with
alias to do it with zero down time.

Related

Elasticsearch: How does search work when using combination of analyzers?

I'm a novice to Elasticsearch (ES), messing around with the analyzers. As the documentation states, the analyzer can be specifed "index time" and "search time", depending on the use case.
My document has a text field title, and i have defined the following mapping that introduces a sub-field custom:
PUT index/_mapping
{
"properties": {
"title": {
"type": "text",
"fields": {
"custom": {
"type": "text",
"analyzer": "standard",
"search_analyzer":"keyword"
}
}
}
}
}
So if i have the text : "email-id is someid#someprovider.com", the standard-analyzer would analyze the text into the following tokens during indexing:
[email, id, is, someid, someprovider.com].
However whenever I try to query on the field (with different variations in query terms) title.custom, it results in no hits.
This is what I think is happening when i query with the keyword: email:
It gets analyzed by the keyword analyzer.
The field title.custom's value also analyzed by keyword analyzer (analysis on tokens), resulting in same set of tokens as mentioned earlier.
An exact match should happen on email token, returning the document.
Clearly this is not the case and there are gaps in my understanding.
I would like to know what exactly is happening during search.
On a generic level, I would like to know how the analysis and search happens when combination of search and index analyzer is specified.
search_analyzer is set to "keyword" for title.custom, making the whole string work as a single search keyword.
So, in order to get a match on title.custom, it is needed to search for "email-id is someid#someprovider.com", not a part of it.
search_analyzer is applied at search time to override the default behavior of the analyzer applied at indexing time.
Good question, but to make it simple let me explain one by one different use cases:
Analyzers plays a role based on
Type of query (match is analyzed while term is not analyzed query).
By default, if the query is analyzed like match query it uses the same analyzer on the search term used on a field that is used at index time.
If you override the default behavior by specifying the search_analyzer on a field that at query time that analyzer is used to create the tokens which will be matched with the tokens generated depends on the analyzer(Standard is default analyzer).
Now using the above three points and explain API you can figure out what is happening in your case.
Let me know if you need further information and would be happy to explain further.
Match vs term query difference and Analyze API to see the tokens will be helpful as well.

Elasticsearch: Set type of field while reindexing? (can it be done with _reindex alone)

Question: Can the Elasticsearch _reindex API be used to set/reset the "field datatypes" of fields that are copied through it?
This question comes from looking at Elastics docs for reindex: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/docs-reindex.html
Those docs show the _reindex API can modify things while they are being copied. They give the example of changing a field name:
POST _reindex
{
"source": {
"index": "from-index"
},
"dest": {
"index":"new-index"
},
"script": {
"source": "ctx._source.New-field-name = ctx._source.remove(\"field-to-change-name-of\")"
}
}
The script clause will cause the "new-index" to have a field called New-field-name, instead of the field with the name field-to-change-name-of from the "from-index"
The documentation implies there is a great deal of flexibility available in the "script" functionality, but its not clear to me if that includes projecting datatypes (for instance quoting data to turn it into a strings/text/keywords, and/or treating things as literals to attempt to turn string data into non-strings (obviously fought with danger)
If setting the datatypes in a _reindex is possible, I'm not assuming it will be efficient and/or be without (perhaps harsh) limits - I just want to better understand the limit of the _reindex functionality (and figure out if I can force a datatype in just one interaction, instead of setting the mapping no the new index before I do the reindex command)
(P.S. I happen to be working on Elasticsearch 6.2, but I think my question holds for all versions that have had the _reindex api (sounds like everything 2.3.0 and greater))
Maybe you are confusing some terms. The part of the documentation you are pointing out refers to the metadata associated with a document, in this case the _type meta field just tells Elasticsearch that a particular document belongs to a specific type (e.g. user type), it is not related to the datatype of a field (e.g. integer or boolean).
If you want to set/reset the mapping of particular fields, you don't even need to use scripting depending on your case. You just have to create the destination index with the new mapping and execute the _reindex API.
But if you want to change the mapping between incompatible values (e.g. a non numerical string into a field with an "integer" datatype), you would need to do some transformation through scripting or through the ingest node.

difference between a field and the field.keyword

If I add a document with several fields to an Elasticsearch index, when I view it in Kibana, I get each time the same field twice. One of them will be called
some_field
and the other one will be called
some_field.keyword
Where does this behaviour come from and what is the difference between both of them?
PS: one of them is aggregatable (not sure what that means) and the other (without keyword) is not.
Update : A short answer would be that type: text is analyzed, meaning it is broken up into distinct words when stored, and allows for free-text searches on one or more words in the field. The .keyword field takes the same input and keeps as one large string, meaning it can be aggregated on, and you can use wildcard searches on it. Aggregatable means you can use it in aggregations in elasticsearch, which resembles a sql group by if you are familiar with that. In Kibana you would probably use the .keyword field with aggregations to count distinct values etc.
Please take a look on this article about text vs. keyword.
Briefly: since Elasticsearch 5.0 string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for simple document with string:
{
"some_field": "string value"
}
below dynamic mapping will be created:
{
"some_field": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.
I hope this answers your question.
Look at this issue. There is some explanation of your question in it. Roughly speaking some_field is analyzed and can be used for fulltext search. On the other hand some_field.keyword is not analyzed and can be used in term queries or in aggregation.
I will try to answer your questions one by one.
Where does this behavior come from?
It is introduced in Elastic 5.0.
What is the difference between the two?
some_field is used for full text search and some_field.keyword is used for keyword searching.
Full text searching is used when we want to include individual tokens of a field's value to be included in search. For instance, if you are searching for all the hotel names that has "farm" in it, such as hay farm house, Windy harbour farm house etc.
Keyword searching is used when we want to include the whole value of the field in search and not individual tokens from the value. For eg, suppose you are indexing documents based on city field. Aggregating based on this field will have separate count for "new" and "york" instead of "new york" which is usually the expected behavior.
From Elastic 5.0 onwards, strings now will be mapped both as keyword and text by default.

ElasticSearch Index Modeling

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?
You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

Is there a need for a boolean field index in elasticsearch

I am trying to optimize my elasticsearch.
I have several boolean fields, which I use queries with.
I could dispense with them, but that would give my client side a hard time.
My question is whether or not setting those fields to "index":"yes" will actually have a significant negative effect on my index's performance, such as indexing time and size (other than the obvious "store" space it would take)?
Does a boolean indexed field really take up more space? It seems it shouldn't. Moreover, I don't see any benefit in creating such an index for any DB, not only elasticsearch.
But, I have to specify "index":"yes" to be able to filter by it, right?
If you want to search against a field you have to index it. By default a boolean field is indexed, and will take a small amount of space to do so. There will be a list of docs where "myfield": true and "myfield": false.
If you didn't want to maintain this index, then when you wanted to find docs where "myfield": true you would have to through every doc to check the field.
If you don't want to search/filter with that field, by all means set "index": "no", just be warned you will need to re-index everything if you change your mind about this field in the future!
Have a look at the elasticsearch docs on mappings; the core types section, scroll down to the boolean type.

Resources