difference between a field and the field.keyword - elasticsearch

If I add a document with several fields to an Elasticsearch index, when I view it in Kibana, I get each time the same field twice. One of them will be called
some_field
and the other one will be called
some_field.keyword
Where does this behaviour come from and what is the difference between both of them?
PS: one of them is aggregatable (not sure what that means) and the other (without keyword) is not.

Update : A short answer would be that type: text is analyzed, meaning it is broken up into distinct words when stored, and allows for free-text searches on one or more words in the field. The .keyword field takes the same input and keeps as one large string, meaning it can be aggregated on, and you can use wildcard searches on it. Aggregatable means you can use it in aggregations in elasticsearch, which resembles a sql group by if you are familiar with that. In Kibana you would probably use the .keyword field with aggregations to count distinct values etc.
Please take a look on this article about text vs. keyword.
Briefly: since Elasticsearch 5.0 string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for simple document with string:
{
"some_field": "string value"
}
below dynamic mapping will be created:
{
"some_field": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.
I hope this answers your question.

Look at this issue. There is some explanation of your question in it. Roughly speaking some_field is analyzed and can be used for fulltext search. On the other hand some_field.keyword is not analyzed and can be used in term queries or in aggregation.

I will try to answer your questions one by one.
Where does this behavior come from?
It is introduced in Elastic 5.0.
What is the difference between the two?
some_field is used for full text search and some_field.keyword is used for keyword searching.
Full text searching is used when we want to include individual tokens of a field's value to be included in search. For instance, if you are searching for all the hotel names that has "farm" in it, such as hay farm house, Windy harbour farm house etc.
Keyword searching is used when we want to include the whole value of the field in search and not individual tokens from the value. For eg, suppose you are indexing documents based on city field. Aggregating based on this field will have separate count for "new" and "york" instead of "new york" which is usually the expected behavior.
From Elastic 5.0 onwards, strings now will be mapped both as keyword and text by default.

Related

Elasticsearch: How does search work when using combination of analyzers?

I'm a novice to Elasticsearch (ES), messing around with the analyzers. As the documentation states, the analyzer can be specifed "index time" and "search time", depending on the use case.
My document has a text field title, and i have defined the following mapping that introduces a sub-field custom:
PUT index/_mapping
{
"properties": {
"title": {
"type": "text",
"fields": {
"custom": {
"type": "text",
"analyzer": "standard",
"search_analyzer":"keyword"
}
}
}
}
}
So if i have the text : "email-id is someid#someprovider.com", the standard-analyzer would analyze the text into the following tokens during indexing:
[email, id, is, someid, someprovider.com].
However whenever I try to query on the field (with different variations in query terms) title.custom, it results in no hits.
This is what I think is happening when i query with the keyword: email:
It gets analyzed by the keyword analyzer.
The field title.custom's value also analyzed by keyword analyzer (analysis on tokens), resulting in same set of tokens as mentioned earlier.
An exact match should happen on email token, returning the document.
Clearly this is not the case and there are gaps in my understanding.
I would like to know what exactly is happening during search.
On a generic level, I would like to know how the analysis and search happens when combination of search and index analyzer is specified.
search_analyzer is set to "keyword" for title.custom, making the whole string work as a single search keyword.
So, in order to get a match on title.custom, it is needed to search for "email-id is someid#someprovider.com", not a part of it.
search_analyzer is applied at search time to override the default behavior of the analyzer applied at indexing time.
Good question, but to make it simple let me explain one by one different use cases:
Analyzers plays a role based on
Type of query (match is analyzed while term is not analyzed query).
By default, if the query is analyzed like match query it uses the same analyzer on the search term used on a field that is used at index time.
If you override the default behavior by specifying the search_analyzer on a field that at query time that analyzer is used to create the tokens which will be matched with the tokens generated depends on the analyzer(Standard is default analyzer).
Now using the above three points and explain API you can figure out what is happening in your case.
Let me know if you need further information and would be happy to explain further.
Match vs term query difference and Analyze API to see the tokens will be helpful as well.

How to query for alternative spellings and representations of words in elasticsearch?

I'm using elasticsearch to query on the theme field in documents. For example:
[
{ theme: 'landcover' },
{ theme: 'land cover' },
{ theme: 'land-cover' },
etc
]
I would like to specify a search of the term landcover that matches all these documents. How do I do this?
So far I've tried using the fuzziness operator in a match search, and also a fuzzy query. However neither of these approaches seems to work, which surprised me because my understanding of fuzzy searches is that they would provide a means of inexact matching.
What am I missing? From the docs I see that fuzziness definitely looks for close approximations to a search term:
When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.
I would consider 'landcover' and 'land cover' to be close. Is this not the case? (this is the first I have heard of Levenshtein Edit Distance so I don't know what extra/less characters mean in terms of this measurement).
An example of a match query that this doesn't seem to work:
{
query: {
match: {
'theme': {
query: 'landcover'
fuzziness: 'AUTO' // I've tried 2, '2', 6, '6', etc.
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'land cover' is matched. But 'landcover' is not
And an example of a 'fuzzy' query that doesn't seem to work:
{
query: {
fuzzy: {
'theme': {
value: query,
fuzziness: 'AUTO', // Tried other values
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'landcover' is matched. But 'land cover' is not. So works almost opposite to the match query in this regard
(NOTE - these queries are converted to JSON and do run and return sensible results, just the fuzziness doesn't seem to work as I would have expected)
Looking around StackOverflow, I see some questions that seem to indicate that querying an index is in some way related to how the index is created - i.e. that i cannot just run adhoc queries on any index that already exists and expect results. Is this correct? (sorry - I'm new to elasticsearch and I'm querying an index that already exists).
This answer seems related (how to find near matches for a search term): https://stackoverflow.com/a/55772800/3114742 - mentions that I should do something referred to as 'field mapping' prior to indexing data. but then the example query doesn't include the fuzziness operator. So in this case I'm confused as to what the point of the fuzziness operator is actually for.
Looking more into the documentation I've found the following:
Elasticsearch uses the concept of an 'index' rather than a database. But from the perspective of someone familiar with CouchDB and MongoDB, which are both JSON stores, there is definitely some similarity between a CouchDB database and an Elasticsearch index. Although the elasticsearch index is not an authoritative data storage in itself (it's 'built' from a source of data).
For a given index called, for example, my-index. you can insert JSON strings (documents) into my-index by PUTting to Elasticsearch:
PUT /... '{... json string ...}'
The JSON string can come directly from a JSON store (Mongo, Couch, etc.) or be cobbled together from a variety of sources. I guess.
Elasticsearch will process the document on insert and append to the inverted tree. For text fields this means K:V pairs will be created from JSON document text, with the keys being fragments of the text, and the values being references to where that text fragment is found in the source (the JSON document).
In other words, when inserting documents into an Elasticsearch index, the content is 'analyzed' to create K:V pairs that are added to the index.
I guess, then, that searching Elasticsearch means looking up search terms that are keys in the index, and comparing the values (the source of the key) to the source defined in the search (I think), and returning the source document where a search term is present for a particular field.
So:
Text is analyzed on insertion to an index
Queries are analyzed (using the same analyzer that was used to create the index)
So in my case (as mentioned above) the default analyzer is good enough to create indices that allow for basic fuzzy matching (i.e. in the match query, "land-cover" is matched to "land cover", and in the fuzzy query, "land-cover" is matched to "landcover" - I have no idea why these match differently!)
But to improve on the search results, I think I need to adjust the analyzer / tokenizer both when inserting documents into an index, and for when parsing queries to apply to an index.
My understanding of the analysis/tokenization is that this is the configuration by which inverted indexes are built from source documents. i.e. defining what the keys of the inverted index will be. As far as I can tell there is no magic in searching the index. search terms have to match keys in the inverted index otherwise there will be no results.
I'm still not sure what fuzziness is actually doing in this context.
So in short, querying elasticsearch seems to require a 'holistic perspective' over both how source data is indexed, and how queries are designed.
As a disclaimer,though, I'm not exactly an authoritative answer on this subject with less than one day of elasticsearch experience, so a better answer would still be appreciated!

Changing field properties

I am using packetbeat to monitor mysql port on 3306 and it is working very well.
I can easily search for any word on discovery tab. For e.g.
method:SET
This works as expected. But If I change it to
query:SET
then it does not return the documents with the word "SET" in query field. Is the query field indexed differently?
How do I make "query" field searchable?
Update:
Is this because of parameter "ignore_above" that is used for all string fields? I checked the mapping using this API...
GET /packetbeat-2018.02.01/_mapping/mysql/
How do I remove this restriction and make all future beats to index query field?
Update 2:
If I mention the entire string in the search based on "query" field, it works as expected...
query:"SELECT name, type, comment FROM mysql.proc WHERE name like 'residentDetails_get' and db <=> 'portal' ORDER BY name, type"
This returns all 688 records in the last 15 minutes. When I search the following, I expect to get more...
query:"SELECT"
But I do not get a single record. I guess this is because the way document is indexed. I will prefer to get back equivalent of SQL : query like '%SELECT%'
That's the correct behavior, given the query and the mapping of the field. And it's not about the 1024 limit. You can either omit query: part so that Elasticsearch will use the _all field (which will be removed in the near future) but here it depends on the version of the Stack you use.
Or, better and more correct approach, is to configure the query field differently in the packetbeat template (so that next indices will use the new mapping) to be like this:
"query": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 1024
}
}
}
The main idea is that ES is not splitting the values in the query field (since it's keyword) and you need a way to do this. You could use wildcards, but ES doesn't like them (especially the leading wildcards) and you could have performance issue when running such a query. The "correct" approach from ES point of view is the one I already mentioned: make the field analyzed, keep a raw version of it (for sorting and aggregations) and the simple version of it for searches.
The query field of packetbeat is declared as "keyword". Therefore you can search the entire query only. For e.g.
query: "select * from mytable"
But what if we need to search for query: "mytable" ?
You need to make the query field searchable by modifying fields.yml file. Add the type:text parameter to query field of MySQL section of fields.yml file found in /etc/packetbeat
The relevant section of the file will look like this...
- name: query
type: text
description: >
The query in a human readable format. For HTTP, it will typically be
something like `GET /users/_search?name=test`. For MySQL, it is
something like `SELECT id from users where name=test`.

Elasticsearch - use a "tags" index to discover all tags in a given string

I have an elasticsearch v2.x cluster with a "tags" index that contains about 5000 tags: {tagName, tagID}. Given a string, is it possible to query the tags index to get all tags that are found in that string? Not only do I want exact matches, but I also want to be able to control for fuzzy matches without being too generous. By too generous, a tag should only match if all tokens in the tag are found within a certain proximity of each other (say 5 words).
For example, given the string:
Model 22340 Sound Spectrum Analyzer
The following tags should match:
sound analyzer sound spectrum analyzer
BUT NOT
sound meter light spectrum chemical analyzer
I don't think it's possible to create an accurate elasticsearch query that will auto-tag a random string. That's basically a reverse query. The most accurate way to match a tag to a document is to construct a query for the tag, and then search the document. Obviously this would be terribly inefficient if you need to iterate over each tag to auto-tag a document.
To do a reverse query, you want to use the Elasticsearch Percolator API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html
The API is very flexible and allows you to create fairly complex queries into documents with multiple fields.
The basic concept is this (assuming your tags have an app specific ID field):
For each tag, create a query for it, and register the query with the percolator (using the tag's ID field).
To auto-tag a string, pass your string (as a document) to the Percolator, which will match it against all registered queries.
Iterate over the matches. Each match includes the _id of the query. Use the _id to reference the tag.
This is also a good article to read: https://www.elastic.co/blog/percolator-redesign-blog-post
"query": {
"match": {
"tagName": {
"query": "Model 22340 Sound Spectrum Analyzer",
"fuzziness": "AUTO",
"operator": "or"
}
}
}
If you want an equal match so that "sound meter" will not match you will have to add another field for each tag containing the terms count in the tag name, add a script to count the terms in the query and add a comparison of the both in the match_query, see: Finding Multiple Exact Values.
Regarding the proximity issue: Since you require "Fuzzyness" you cannot control the proximity because the "match_phrase" query is not integrated with Fuzzyness, as stated by Elastic docs Fuzzy-match-query:
Fuzziness works only with the basic match and multi_match queries. It doesn’t work with phrase matching, common terms, or cross_fields matches.
so you need to decide: Fuzzyness vs. Proximity.
Of course you can. You can achieve what you want to get using only just match query with standard analyzer.
curl -XGET "http://localhost:9200/tags/_search?pretty" -d '{
"query": {
"match" : {
"tagName" : "Model 22340 Sound Spectrum Analyzer"
}
}
}'

ElasticSearch: How to specify specific fields to search at?

Right now in my mapping, I am setting "include_in_all" to true, which means all the fields are included in _all field.
However, when I am searching, instead of wasting space, and putting everything in the _all field, I want to specify the specific fields to certain for (and taking into account the boost scores in the mapping).
How do I create a query that tells Elastic Search to only look at specific fields(not just 1) and take into account the boosting I gave it during my mapping?
Start with a multi_match query. It allows you to query multiple fields, giving them different weights, and it's usually the way to go when you have a search box.
{
"multi_match" : {
"query" : "this is a test",
"fields" : [ "subject^2", "message" ]
}
}
The query_string is more powerful but more dangerous too since it's parsed and can break. Use it only if you need it.
You don't need to keep data in _all field to query for a field.
You can use query_string or bool queries to search over multiple fields.

Resources