I create a custom analyzer in my Elasticsearch, I want to seperate from only white space of my words in my defined field, is "my_field". And my search needs to be case insensitive, for this feature, I used lowercase filter.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"trim"
]
}
}
}
}
},
"mappings" : {
"my_type" : {
"properties" : {
"my_field" : {
"type" : "string",
"analyzer" : "my_custom_analyzer"
}
}
}
}
After that this creation, I analyze my sample data:
POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "my_Sample_TEXT"
}
and the output is
{
"tokens": [
{
"token": "my_sample_text",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
I have many data in my documents and in "my_field" that contains "my_Sample_TEXT" but when I search for this text using query string, result returns 0:
GET my_index/_search
{
"query": {
"query_string" : {
"default_field" : "my_type",
"query" : "*my_sample_text*",
"analyzer" : "my_custom_analyzer",
"enable_position_increments": true,
"default_operator": "AND"
}
}
}
My result is:
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
I found that, this problem happens when my text has underscore and uppercase text, can anyone help me to fix this problem?
Could you try to change your mapping part of "my_filed" as below:
"my_field" : {
"type" : "string",
"analyzer" : "my_custom_analyzer"
"search_analyzer": "my_custom_analyzer"
}
Because, ES use standart analyzer when you do not set any analyzer. And standart analyzer can create multiple tokens from your underscored text.
Related
I have stopwords for brazilian portuguese configured at my index. but if I made a search for the term "ios" (it's a ios course), a bunch of other documents are returned, because the term "nos" (brazilian stopword) seems to be identified as a valid term for the fuzzy query.
But if I search just by the term "nos", nothing is returned. I would be not expected ios course to be returned by fuzzy query? I'm confused.
Is there any alternative to this. The main purpose here is that when user search for ios, the documents with stopword like "nos" won't be returned, while I can mantain the fuzziness for other more complex search made by users.
An example of query:
GET /index/_search
{
"explain": true,
"query": {
"bool" : {
"must" : [
{
"terms" : {
"document_type" : [
"COURSE"
],
"boost" : 1.0
}
},
{
"multi_match" : {
"query" : "ios",
"type" : "best_fields",
"operator" : "OR",
"slop" : 0,
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
part of explain query:
"description": "weight(corpo:nos in 52) [PerFieldSimilarity], result of:",
image with the config of stopwords
thanks
I tried to add the prefix length, but I want that stopwords to be ignored.
I believe that correctly way to work stopwords by language is below:
PUT idx_teste
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop_filter": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"teste_analyzer": {
"tokenizer": "standard",
"filter": ["brazilian_stop_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "teste_analyzer"
}
}
}
}
POST idx_teste/_analyze
{
"analyzer": "teste_analyzer",
"text":"course nos advanced"
}
Look term "nos" was removed.
{
"tokens": [
{
"token": "course",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "advanced",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I want to perform such kind of query such that the query shows output if and only if all the words in the query are present in the given string as a string or query
For example -
let text = "garbage can"
so if I query
"garb"
it should return "garbage can"
if I query
"garbage ca"
it should return "garbage can"
but if I query
"garbage b"
it should not return anything
I tried using substring and also match but they both did not quite did the job for me.
You may use an Edge N-gram tokenizer to index your data.
You can also have custom token_chars in the latest 7.8 version!
Have a look at the documentation for more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
I guess you want to do a prefix query. please try the following prefix query:
GET /test_index/_search
{
"query": {
"prefix": {
"my_keyword": {
"value": "garbage b"
}
}
}
}
However, this kind of prefix query's performance is not good.
You could try the following query by using customized prefix analyser.
First, create a new index:
PUT /test_index
{
"settings": {
"index": {
"number_of_shards": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1"
}
},
"mappings": {
"properties": {
"my_text": {
"analyzer": "autocomplete",
"type": "text"
},
"my_keyword": {
"type": "keyword"
}
}
}
}
Second, insert data into this index:
PUT /test_index/_doc/1
{
"my_text": "garbage can",
"my_keyword": "garbage can"
}
Query with "garbage c"
GET /test_index/_search
{
"query": {
"term": {
"my_text": "garbage c"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.45802015,
"hits" : [
{
"_index" : "test_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.45802015,
"_source" : {
"my_text" : "garbage can",
"my_keyword" : "garbage can"
}
}
]
}
}
Query with "garbage b"
GET /test_index/_search
{
"query": {
"term": {
"my_text": "garbage b"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
If you don't want to do a prefix query, you could try the following wildcard query. Please remember the performance is bad and you could also try to use customeized analyser to optimize it.
GET /test_index/_search
{
"query": {
"wildcard": {
"my_keyword": {
"value": "*garbage c*"
}
}
}
}
New Edit Part
I'm not sure if I got want you really want this time....
Anyway, please try to use the following _mapping and queries:
1. Create Index
PUT /test_index
{
"settings": {
"index": {
"max_ngram_diff": 50,
"number_of_shards": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 51,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1"
}
},
"mappings": {
"properties": {
"my_text": {
"analyzer": "autocomplete",
"type": "text"
},
"my_keyword": {
"type": "keyword"
}
}
}
}
2. Insert some smaple data
PUT /test_index/_doc/1
{
"my_text": "test garbage can",
"my_keyword": "test garbage can"
}
PUT /test_index/_doc/2
{
"my_text": "garbage",
"my_keyword": "garbage"
}
3. Query
GET /test_index/_search
{
"query": {
"term": {
"my_text": "bage c"
}
}
}
Please Note:
This index only support string which max length is 50. Otherwise you need to modified max_ngram_diff, min_gram and max_gram
It need a lot of mem to build the reversing index
I'm using Kibana to look at a geospatial dataset in Elasticsearch for a feature currently under development. There is a index of positions which contains field "loc.coordinates", which is a geo_point, and has as data as such:
loc.coordinates 25.906958000000003, 51.776407000000006
However when I run the following query I get no results:
Query
GET /positions/_search
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "2000km",
"loc.coordinates" : {
"lat" : 25,
"lon" : 51
}
}
}
}
}
}
Response
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
I'm trying to understand why this is, as there are over 250,000 datapoints in the index, and I'm getting no hits regardless of how big the search area is. When I look in the position index mapping I see the following:
"loc": {
"type": "nested",
"properties": {
"coordinates": {
"type": "geo_point"
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
I'm new to Elasticsearch and have been making my way through the documentation, but so far I don't see why my geo queries aren't working as expected. What am I doing wrong?
Your loc field is of type nested, so you need to query that field accordingly with a nested query:
GET /positions/_search
{
"query": {
"bool" : {
"filter" : {
"nested": {
"path": "loc",
"query": {
"geo_distance" : {
"distance" : "2000km",
"loc.coordinates" : {
"lat" : 25,
"lon" : 51
}
}
}
}
}
}
}
}
I'm sending this request
curl -XGET 'host/process_test_3/14/_search' -d '{
"query" : {
"query_string" : {
"query" : "\"*cor interface*\"",
"fields" : ["title", "obj_id"]
}
}
}'
And I'm getting correct result
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 5.421598,
"hits": [
{
"_index": "process_test_3",
"_type": "14",
"_id": "141_dashboard_14",
"_score": 5.421598,
"_source": {
"obj_type": "dashboard",
"obj_id": "141",
"title": "Cor Interface Monitoring"
}
}
]
}
}
But when I want to search by word part, as example
curl -XGET 'host/process_test_3/14/_search' -d '
{
"query" : {
"query_string" : {
"query" : "\"*cor inter*\"",
"fields" : ["title", "obj_id"]
}
}
}'
I'm getting no results back:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : []
}
}
What am I doing wrong?
This is because your title field has probably been analyzed by the standard analyzer (default setting) and the title Cor Interface Monitoring has been tokenized as the three tokens cor, interface and monitoring.
In order to search any substring of words, you need to create a custom analyzer which leverages the ngram token filter in order to also index all substrings of each of your tokens.
You can create your index like this:
curl -XPUT localhost:9200/process_test_3 -d '{
"settings": {
"analysis": {
"analyzer": {
"substring_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "substring"]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"14": {
"properties": {
"title": {
"type": "string",
"analyzer": "substring_analyzer"
}
}
}
}
}'
Then you can reindex your data. What this will do is that the title Cor Interface Monitoring will now be tokenized as:
co, cor, or
in, int, inte, inter, interf, etc
mo, mon, moni, etc
so that your second search query will now return the document you expect because the tokens cor and inter will now match.
+1 to Val's solution.
Just wanted to add something.
Since your query is relatively simple, you may want to have a look at match/match_phrase queries. Match queries does have the regex parsing like query_string and are thus lighter.
You can find the details here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
I am using ElasticSearch to store the Tweets I receive from the Twitter Streaming API. Before storing them I'd like to apply an english stemmer to the Tweet content, and to do that I'm trying to use ElasticSearch analyzers with no luck.
This is the current template I am using:
PUT _template/twitter
{
"template": "139*",
"settings" : {
"index":{
"analysis":{
"analyzer":{
"english":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
}
},
"filter":{
"stop_english":{
"type":"stop",
"stopwords":["_english_"]
},
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
},
"mappings": {
"tweet": {
"_timestamp": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"_index": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"properties": {
"geo": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"text": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
When I start the Streaming and the index is created, all the mappings I've defined seem to apply correctly, but the text is stored as it comes from Twitter, completely raw. The index metadata shows:
"settings" : {
"index" : {
"uuid" : "xIOkEcoySAeZORr7pJeTNg",
"analysis" : {
"filter" : {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"stop_english" : {
"type" : "stop",
"stopwords" : [
"_english_"
]
}
},
"analyzer" : {
"english" : {
"type" : "custom",
"filter" : [
"lowercase",
"en_stemmer",
"stop_english",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "1010099"
}
}
},
"mappings" : {
"tweet" : {
[...]
"text" : {
"analyzer" : "english",
"type" : "string"
},
[...]
}
}
What am I doing wrong? The analyzers seems to be applied correctly, but nothing is happening :/
Thank you!
PS: The search query I use to realize the analyzer is not being applied:
curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "_index:1397574496990"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"match_all": {}
},
{
"exists": {
"field": "geo.coordinates"
}
}
]
}
}
}
},
"fields": [
"geo.coordinates",
"text"
],
"size": 50000
}'
This should return the stemmed text as one of the fields, but the response is:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 47,
"successful": 47,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.97402453,
"hits": [
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086643423068161",
"_score": 0.97402453,
"fields": {
"geo.coordinates": [
-118.21122533,
33.79349318
],
"text": [
"Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone 🌊🐢🐢🐢☀️#turtles… http://t.co/wAVmcxnf76"
]
}
},
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086701451259904",
"_score": 0.97333175,
"fields": {
"geo.coordinates": [
-81.017636,
33.998741
],
"text": [
"Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
]
}
}
]
}
}
The text field is exactly the same that came from Twitter (I'm using the streaming API). What I expect is the text fields stemmed, as the analyzer is applied.
Analyzers don't affect the way data is stored. So, no matter which analyzer you are using you will get the same text back from source and stored fields. Analyzer are applied when you search. So by searching for something like text:twin and finding records with the word Twins, you will know that stemmer was applied.