Difference between keyword and text in ElasticSearch - elasticsearch

Can someone explain the difference between keyword and text in ElasticSearch with an example?

keyword type:
if you define a field to be of type keyword like this.
PUT products
{
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "keyword"
}
}
}
}
}
Then when you make a search query on this field you have to insert the whole value (keyword search) so keyword field.
POST products/_doc
{
"name": "washing machine"
}
when you execute search like this:
GET products/_search
{
"query": {
"match": {
"name": "washing"
}
}
}
it will not match any docs. You have to search with the whole word "washing machine".
text type on the other hand is analyzed and you can search using tokens from the field value. a full text search in the whole value:
PUT products
{
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}
and the search :
GET products/_search
{
"query": {
"match": {
"name": "washing"
}
}
}
will return a matching documents.
You can check this to more details keyword Vs. text

The primary difference between the text datatype and the keyword datatype is that text fields are analyzed at the time of indexing, and keyword fields are not.
What that means is, text fields are broken down into their individual terms at indexing to allow for partial matching, while keyword fields are indexed as is.
Keyword Mapping
"channel" : {
"name" : "keyword"
},
"product_image" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}

Along with the other advantages of keyword type in elastic search, one more is that you can store any data type inside of it. Be it string, numeric, date, etc.
PUT /demo-index/
{
"mappings": {
"properties": {
"name": { "type": "keyword" }
}
}
}
POST /demo-index/_doc
{
"name": "2021-02-21"
}
POST /demo-index/_doc
{
"name": 100
}
POST /demo-index/_doc
{
"name": "Jhon"
}

Related

Wildcard doesn't work as expected when querying by more than a word

If I search documents containing e.g "called" in "message" field I get an expected result, but when I search for "was called", "was called*" or
"*was called*"
I get nothing, although I have a lot of documents whose message field contains the following content "Application was called by REST API".
Here is a part of a query I send:
"wildcard": {
"message": {
"wildcard": "was called",
"boost": 1.0
}
}
Here is a part of the mapping:
"mappings": {
"doc": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"match_mapping_type": "string",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"norms": false,
"type": "text"
}
}
}
],
"properties": {
...
"message": {
"type": "text",
"norms": false
}
}
}
}
Indexes I search in are automatically created by Logstash.
I have a similar problem with another field; I have the following value in the field: "NP-00121". *00121 works, but *-00121 doesn't.
edit: and one example more: I have a "requestUri" field containing "/api/v1/log/rest", "/api/v1/log/notification" etc. when I send the following wildcard query I get nothing "/api/v1*".
So it looks like problem appears when using spaces and dashes. Could anyone help me to solve this problem?
Wildcards are used within tokens. Your message field is indexed as text, and so will be tokenized into words.
Basically, you don't need wildcards for a query like "was called". Simply use a phrase query like:
"query": {
"match_phrase" : {
"message" : "was called"
}
}
or if you prefer a query string query:
"query": {
"query_string" : {
"query" : "message:\"was called\""
}
}
A wildcard query would be useful for searching for partial terms, something like:
"query": {
"wildcard" : { "message" : "call*" }
}
If you wanted to find all docs that contain "call", "called" or "calling".
For values like NP-00121, or for URIs, it would likely be more useful if those fields were not analyzed. As it is these are getting separated into tokens ('np' and '00121'), thus the problem you are seeing. You can index these fields as the "keyword" type instead of "text", to have the whole field indexed as a single, unanalyzed token.

Is there a way to make elasticsearch case-insensitive without altering the existing documents?

Does Elasticsearch allow us to query documents case-insensitive? Or should I save them as case-insensitive before querying? Or is there some setting that I should set for the whole index to make it case-insensitive?
Can you clarify this moment please?
By Default, the fields are case-insensitive because of the mapping elastic applied.
Try below:
PUT myindex/doc/1
{
"name":"TEST"
}
GET myindex/_mapping
It should return :
{
"myindex": {
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Now if you query with below, it will return a match (notice the mapping[text and keyword]):
POST myindex/_search
{
"query": {
"match": {
"name2": "test"
}
}
}
Now, if you explicitly specify to index the field as keyword, then it will be case-sensitive search. Try below and see; it will not return any results.
PUT myindex/_mapping/doc
{
"properties": {
"name2": {
"type": "keyword"
}
}
}
PUT myindex/doc/1
{
"name2":"TEST"
}
POST myindex/_search
{
"query": {
"match": {
"name2": "test"
}
}
}
TLDR; Use default mapping or text type- if you specify the field to index only keyword type, it will be case-sensitive.

Elasticsearch 5.X Percolate: How to autogenerate copy_to fields?

In ES 2.3.3, many queries in the system I'm working on use the _all field. Sometimes these are registered to a percolate index, and when running percolator on the doc, _all is generated automatically.
In converting to ES 5.X _all is being deprecated and so _all has been replaced with a copy_to field that contains the components that we actually care about, and it works great for those searches.
Registering the same query to a percolate index with the same document mapping including copy_to fields works fine. Sending a percolate query with the document never results in a hit for a copy_to field however.
Manually building the copy_to field via simple string concatenation seems to work, it's just that I'd expect to be able to Query -> DocIndex and get the same result as Doc -> PercolateQuery... So I'm just looking for a way to have ES generate the copy_to fields automatically on a document being percolated.
Ended up there was nothing wrong with ES of course, posting here in case it helps someone else. Figured it out while attempting to generate a simpler example to post here with details... Basically the issue came down to the fact that attempting to percolate a document of a type that doesn't exist in the percolate index doesn't give any errors back, but seems to apply all percolate queries without applying any mappings which was just confusing as it worked for simple test cases, but not complex ones. Here's an example:
From the copy_to docs, generate an index with a copy_to mapping. See that a query to the copy_to field works.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
}
}
}
PUT my_index/my_type/1
{
"first_name": "John",
"last_name": "Smith"
}
GET my_index/_search
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
}
}
Create a percolate index with the same type
PUT /my_percolate_index
{
"mappings": {
"my_type": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
},
"queries": {
"properties": {
"query": {
"type": "percolator"
}
}
}
}
}
Create a percolate query that matches our other percolate query on the copy_to field, and a second query that just queries on a basic unmodified field
PUT /my_percolate_index/queries/1?refresh
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
}
}
PUT /my_percolate_index/queries/2?refresh
{
"query": {
"match": {
"first_name": {
"query": "John"
}
}
}
}
Search, but with the wrong type... there will be a hit on the basic field (first_name: John) even though no document mappings match the request
GET /my_percolate_index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document_type" : "non_type",
"document" : {
"first_name": "John",
"last_name": "Smith"
}
}
}
}
{"took":7,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1,"max_score":0.2876821,"hits":[{"_index":"my_percolate_index","_type":"queries","_id":"2","_score":0.2876821,"_source":{
"query": {
"match": {
"first_name": {
"query": "John"
}
}
}
}}]}}
Send in the correct document_type and see both matches as expected
GET /my_percolate_index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document_type" : "my_type",
"document" : {
"first_name": "John",
"last_name": "Smith"
}
}
}
}
{"took":7,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":2,"max_score":0.51623213,"hits":[{"_index":"my_percolate_index","_type":"queries","_id":"1","_score":0.51623213,"_source":{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
}
}},{"_index":"my_percolate_index","_type":"queries","_id":"2","_score":0.2876821,"_source":{
"query": {
"match": {
"first_name": {
"query": "John"
}
}
}
}}]}}

Elasticsearch query_string query with multiple default fields

I would like to avail myself of the feature of a query_string query, but I need the query to search by default across a subset of fields (not all, but also not just one). When I try to pass many default fields, the query fails. Any suggestions?
Not specifying a specific field in the query, so I want to search three fields by default:
{
"query": {
"query_string" : {
"query" : "some search using advanced operators OR dog",
"default_field": ["Title", "Description", "DesiredOutcomeDescription"]
}
}
}
If you want to create a query on 3 specific fields as above, just use the fields parameter.
{
"query": {
"query_string" : {
"query" : "some search using advanced operators OR dog",
"fields": ["Title", "Description", "DesiredOutcomeDescription"]
}
}
}
Alternatively, if you want to search by default on those 3 fields without specifying them, you will have to use the copy_to parameter when you set up the mapping. Then set the default field to be the concatenated field.
PUT my_index
{
"settings": {
"index.query.default_field": "full_name"
},
"mappings": {
"my_type": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
}
}
}
I have used this and don't recommend it because the control over the tokenization can be limiting, as you can only specify one tokenizer for the concatenated field.
Here is the page on copy_to.

Why prefix returns documents without the specific prefix?

I want to return only documents which their name start with "pizza". this is what I've done:
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name": "pizza"
}
}
}
}
}
But I've got these 3 documents:
{
"name": "Viana Pizza",
"city": "Mashhad",
"address": "Vakil abad",
"foods": ["Pizza"],
"salad": true,
"rate": 5.0
}
{
"name": "Pizza Pizza",
"city": "Mashhad",
"address": "Bahar st",
"foods": ["Pizza"],
"salad": true,
"rate": 8.5
}
{
"name": "Reza Pizza",
"city": "Tehran",
"address": "Vali Asr",
"foods": ["Pizza"],
"salad": true,
"rate": 7.5
}
As you can see, Only one of them has "pizza" in the beginning of the name field.
What's wrong?
Probably, the simplest explanation given that you didn't provide the actual mapping, is that you have th e "name" field as "string" and "analyzed" (the default). Which means that "Reza Pizza" will be transformed to "reza" and "pizza" terms.
And your filter will match against terms, not against entire fields. Because ES analyzes the fields and forms terms when the standard mapping is used.
You need to either change your "name" field to "not_analyzed" or add another field to mirror the "name" but this mirror field to be "not_analyzed". Also, for text "pizza" (lowercase) to work in this case you need to create a custom analyzer.
Below you have the solution with the mirror field:
PUT /pizza
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"restaurant": {
"properties": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
And in searching you need to use the mirror field:
GET /pizza/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"prefix": {
"name.raw": "pizza"
}
}
}
}
}
That's all about Elasticsearch analyzers. Let's read the documentation on prefix filter:
Filters documents that have fields containing terms with a specified prefix (not analyzed).
Here we can see that this filter matches terms, not the whole field value. When you index the document, ES splits your field values to terms using analyzers. Default analyzer splits value by whitespace and convert parts to lowercse. So all three results have term pizza in the name field and pizza term perfectly matches pizza prefix. If you want to match field value as is - I'd suggest you to map name field as not_analyzed

Resources