Cannot retrieve data which includes specific symbols in Kibana - elasticsearch

I try to use Kibana to retrive the comment data which includes some specific symbols like ?and 。 They are not general symbols.
I try to use escape character \ for them, the KQL is like comment:\?or comment:\\?, but it doesn't work, can anyone help?

When you create a sample doc and let ES auto-generate the mapping for you,
POST comments/_doc
{
"comment": "?"
}
running
GET comments/_mapping
will get you
"comment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
Now, the text type's analyzer is usually standard by default.
When we attempt to see how our non-standard chars got analyzed
GET comments/_analyze
{
"text": "?",
"analyzer": "standard"
}
the result is
{
"tokens" : [ ]
}
meaning we cannot search for its contents using the standard-analyzed text field but need to
either define a different default analyzer
or define this analyzer in one of the comment's fields
Going with the 2nd approach (since it's good practice to keep differently-analyzed fields separate),
PUT comments2
{
"mappings": {
"properties": {
"comment": {
"type": "text",
"fields": {
"whitespace_analyzed": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
}
POST comments2/_doc
{
"comment": "?"
}
After verifying
GET comments2/_analyze
{
"text": "?",
"analyzer": "whitespace"
}
we can do the following in KQL
comment.whitespace_analyzed:"?"
Note that there are a bunch of built-in analyzers to choose from but you're more than welcome to create your own.

Related

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Simplest lowercase example for es keyword type

I would like to ignore case on searches. What would be the least verbose way to do this. For example, something like:
"mappings": {
"_doc": {
"properties": {
"name": {"type": "keyword", "analyzer": "ignore_case"}
}
}
}
The above is pseudo-code but what would be the best way to do this? Basically I want to have a word like:
"Hello"
And have "Hello" or "HELLO" or "hELlo" or "hello" match it.
Keyword datatype doesn't use Analyzers. You need to make use of Normalizer
If you intend to make use of keyword in that case you need to create a custom Normalizer with filter configured as lowercase and your mapping should be as follows:
PUT <your_index_name>
{
"settings":{
"analysis":{
"normalizer":{
"my_custom_normalizer":{
"type":"custom",
"filter":[
"lowercase"
]
}
}
}
},
"mappings":{
"mydocs":{
"properties":{
"mytext":{
"type":"keyword",
"normalizer":"my_custom_normalizer"
}
}
}
}
}
I think that is the least verbose way! Hope it helps!

What is the best way to handle common term which contains special chars, like C#, C++

I have some documents contains c# or c++ in title which use standard analyzer.
When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.
What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".
Here is approach you can use:
Introduce copy-field where you will have values with special characters. For that you'll need:
Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase"
]
}
}
}
}
}
Create copy-field (_wcc suffix will stand for 'with special characters'):
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"prog_lang": {
"type": "text",
"copy_to": "prog_lang_wcc",
"analyzer": "standard"
},
"prog_lang_wcc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):
GET /_search
{
"query": {
"multi_match" : {
"query" : "c#",
"type": "match_phrase",
"fields" : [ "prog_lang_wcc^3", "prog_lang" ]
}
}
}

Keep non-stemmed tokens on Elasticsearch

I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server):
{
"analysis":{
"filter":{
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true,
},
"stop_pt":{
"type": "stop",
"ignore_case": true,
"stopwords": "_brazilian_"
},
"stemmer_pt": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_asciifolding",
"stop_pt",
"stemmer_pt"
]
}
}
}
}
I haven't really touched my type mappings (apart from a few numeric fields I've declared "type":"long") so I expect most fields to be using this default analyzer I've specified above.
This works as expected, but the thing is that some users are frustrated because (since tokens are being stemmed), the query "vulnerabilities" and the query "vulnerable" return the same results, which is misleading because they expect the results having an exact match to be ranked first.
Whats is the default way (if any) to do this in elasticsearch? (maybe keep the unstemmed tokens in the index as well as the stemmed tokens?) I'm using version 1.5.1.
I ended up using "fields" field to index my attributes in different ways. Not sure whether this is optimal but this is the way I'm handling it right now:
Add another analyzer (I called it "no_stem_analyzer") with all filters that the "default" analyzer has, minus "stemmer".
For each attribute I want to keep both non stemmed and stemmed variants, I did this (example for field "DESCRIPTION"):
"mappings":{
"_default_":{
"properties":{
"DESCRIPTION":{
"type"=>"string",
"fields":{
"no_stem":{
"type":"string",
"index":"analyzed",
"analyzer":"no_stem_analyzer"
},
"stemmed":{
"type":"string",
"index":"analyzed",
"analyzer":"default"
}
}
}
},//.. other attributes here
}
}
At search time (using query_string_query) I must also indicate (using field "fields") that I want to search all sub-fields (e.g. "DESCRIPTION.*")
I also based my approach upon [this answer].(elasticsearch customize score for synonyms/stemming)

Overriding default keyword analysis for elasticsearch

I am trying to configure an elasticsearch index to have a default indexing policy of analysis with the keyword analyzer, and then overriding it on some fields, to allow them to be free text analyzed. So effectively opt-in free text analysis, where I am explicitly specifying in the mapping which fields are analysed for free text matching. My mapping defintion looks like this:
PUT test_index
{
"mappings":{
"test_type":{
"index_analyzer":"keyword",
"search_analyzer":"standard",
"properties":{
"standard":{
"type":"string",
"index_analyzer":"standard"
},
"keyword":{
"type":"string"
}
}
}
}
}
So standard should be an analyzed field, and keyword should be exact match only. However when I insert some sample data with the following command:
POST test_index/test_type
{
"standard":"a dog in a rug",
"keyword":"sheepdog"
}
I am not getting any matches against the following query:
GET test_index/test_type/_search?q=dog
However I do get matches against:
GET test_index/test_type/_search?q=*dog*
Which makes me think that the standard field is not being analyzed. Does anyone know what I am doing wrong?
Nothing's wrong with the index created. Change your query to GET test_index/test_type/_search?q=standard:dog and it should return the expected results.
If you do not want to specify field name in the query, update your mapping such that you provide the index_analyzer and search_analyzer values explicitly for each field with no default values. See below:
PUT test_index
{
"mappings": {
"test_type": {
"properties": {
"standard": {
"type": "string",
"index_analyzer": "standard",
"search_analyzer": "standard"
},
"keyword": {
"type": "string",
"index_analyzer": "keyword",
"search_analyzer": "standard"
}
}
}
}
}
Now if you try GET test_index/test_type/_search?q=dog, you'll get the desired results.

Resources