How to match on prefix in Elasticsearch - elasticsearch

let's say that in my elasticsearch index I have a field called "dots" which will contain a string of punctuation separated words (e.g. "first.second.third").
I need to search for e.g. "first.second" and then get all entries whose "dots" field contains a string being exactly "first.second" or starting with "first.second.".
I have a problem understanding how the text querying works, at least I have not been able to create a query which does the job.

Elasticsearch has Path Hierarchy Tokenizer that was created exactly for such use case. Here is an example of how to set it for your index:
# Create a new index with custom path_hierarchy analyzer
# See http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer.html
curl -XPUT "localhost:9200/prefix-test" -d '{
"settings": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"type": "custom",
"tokenizer": "prefix-test-tokenizer"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"type": "path_hierarchy",
"delimiter": "."
}
}
}
},
"mappings": {
"doc": {
"properties": {
"dots": {
"type": "string",
"analyzer": "prefix-test-analyzer",
//"index_analyzer": "prefix-test-analyzer", //deprecated
"search_analyzer": "keyword"
}
}
}
}
}'
echo
# Put some test data
curl -XPUT "localhost:9200/prefix-test/doc/1" -d '{"dots": "first.second.third"}'
curl -XPUT "localhost:9200/prefix-test/doc/2" -d '{"dots": "first.second.foo-bar"}'
curl -XPUT "localhost:9200/prefix-test/doc/3" -d '{"dots": "first.baz.something"}'
curl -XPOST "localhost:9200/prefix-test/_refresh"
echo
# Test searches.
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true" -d '{
"query": {
"term": {
"dots": "first"
}
}
}'
echo
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true" -d '{
"query": {
"term": {
"dots": "first.second"
}
}
}'
echo
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true" -d '{
"query": {
"term": {
"dots": "first.second.foo-bar"
}
}
}'
echo
curl -XPOST "localhost:9200/prefix-test/doc/_search?pretty=true&q=dots:first.second"
echo

There is also a much easier way, as pointed out in elasticsearch documentation:
just use:
{
"text_phrase_prefix" : {
"fieldname" : "yourprefix"
}
}
or since 0.19.9:
{
"match_phrase_prefix" : {
"fieldname" : "yourprefix"
}
}
instead of:
{
"prefix" : {
"fieldname" : "yourprefix"
}

Have a look at prefix queries.
$ curl -XGET 'http://localhost:9200/index/type/_search' -d '{
"query" : {
"prefix" : { "dots" : "first.second" }
}
}'

You should use a commodin chars to make your query, something like this:
$ curl -XGET http://localhost:9200/myapp/index -d '{
"dots": "first.second*"
}'
more examples about the syntax at: http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html

I was looking for a similar solution - but matching only a prefix. I found #imtov's answer to get me almost there, but for one change - switching the analyzers around:
"mappings": {
"doc": {
"properties": {
"dots": {
"type": "string",
"analyzer": "keyword",
"search_analyzer": "prefix-test-analyzer"
}
}
}
}
instead of
"mappings": {
"doc": {
"properties": {
"dots": {
"type": "string",
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword"
}
}
}
}
This way adding:
'{"dots": "first.second"}'
'{"dots": "first.third"}'
Will add only these full tokens, without storing first, second, third tokens.
Yet searching for either
first.second.anyotherstring
first.second
will correctly return only the first entry:
'{"dots": "first.second"}'
Not exactly what you asked for but somehow related, so I thought could help someone.

Related

Elasticseach not using synonyms from synonym file

I am new to elasticsearch so before downvoting or marking as duplicate, please read the question first.
I am testing synonyms in elasticsearch (v 2.4.6) which I have installed on Ubuntu 16.04. I am giving synonyms through a file named synonym.txt which I have placed in config directory. I have created an index synonym_test as follows-
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}'
The index contains two fields- id and some_text. I configure the field some_text with the custom analyzer as follows-
curl -XPUT localhost:9200/synonym_test/rulers/_mapping -d '{
"properties": {
"id": {
"type": "double"
},
"some_text": {
"type": "string",
"search_analyzer": "my_synonyms"
}
}
}'
Then I have inserted some data as -
curl -XPUT localhost:9200/synonym_test/external/5 -d '{
"id" : "5",
"some_text":"apple is a fruit"
}'
curl -XPUT localhost:9200/synonym_test/external/7 -d '{
"id" : "7",
"some_text":"english is spoken in england"
}'
curl -XPUT localhost:9200/synonym_test/external/8 -d '{
"id" : "8",
"some_text":"Scotland Yard is a popular game."
}'
curl -XPUT localhost:9200/synonym_test/external/9 -d '{
"id" : "9",
"some_text":"bananas contain potassium"
}'
The synonym.txt file contains following-
"britain,england,scotland"
"fruit,bananas"
After doing all this, when I run the query for term fruit (which should also return the text containing bananas as they are synonyms in file), I get the text containing fruit only.
{
"took":117,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":0.8465736,
"hits":[
{
"_index":"synonym_test",
"_type":"external",
"_id":"5",
"_score":0.8465736,
"_source":{
"id":"5",
"some_text":"apple is a fruit"
}
}
]
}
}
I have also tried the following links, but none seem to have helped me -
Synonym analyzer not working ,
Elasticsearch synonym analyzer not working , How to apply synonyms at query time instead of index time in Elasticsearch , how to configure the synonyms_path in elasticsearch and many other links.
So, can anyone please tell me if I am doing anything wrong? Is there anything wrong with the settings or synonym file? I want the synonyms to work (query time) so that when I search for a term, I get all documents related to that term.
Please refer to following url: Custom Analyzer on how you should configure custom analyzers.
If we follow the guides from above documentation our schema will be as follows:
curl -XPOST localhost:9200/synonym_test/ -d '{
"settings": {
"analysis": {
"analyzer": {
"type": "custom"
"my_synonyms": {
"tokenizer": "whitespace",
"filter": ["lowercase","my_synonym_filter"]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"ignore_case": true,
"synonyms_path" : "synonym.txt"
}
}
}
}
}
Which currently works on my elasticsearch instance.

Elasticsearch does not filter as expected

I am using Elasticsearch 1.4
I have an Index:
curl -XPUT "http://localhost:49200/customer" -d '{"mappings": {"venues": {"properties": {"party_id": {"type": "string"},"sup_party_id": {"type": "string"},"location": {"type": "geo_point"} } } }}'
And put some data, for instances:
curl -XPOST "http://localhost:49200/customer/venues/RO2" -d '{ "party_id":"RO2", "sup_party_id": "SUP_GT_R1A_0001","location":{ "lat":"21.030347","lon":"105.842896" }}'
curl -XPOST "http://localhost:49200/customer/venues/RO3" -d '{ "party_id":"RO3", "sup_party_id": "SUP_GT_R1A_0004","location":{ "lat":"20.9602051","lon":"105.78709179999998" }}'
and my filter is:
{"constant_score":
{"filter":
{"and":
[{"terms":
{"sup_party_id":["SUP_GT_R1A_0004","SUP_GT_R1A_0001","RO2","RO3","RO4"]
}
},{"geo_bounding_box":
{"location":
{"top_left":{"lat":25.74546096707413,"lon":70.43503197075188},
"bottom_right":{"lat":6.342579199578783,"lon":168.96042259575188}
}
}
}]
}
}
}
the above query does not return data but It return data when I remove the following terms:
{"terms":
{"sup_party_id":["SUP_GT_R1A_0004","SUP_GT_R1A_0001","RO2","RO3","RO4"]
}
}
Please show me the problem, any suggestions is appreciated!
That's because the sup_party_id field is an analyzed string. Change your mapping like this instead and it will work:
curl -XPUT "http://localhost:49200/customer" -d '{
"mappings": {
"venues": {
"properties": {
"party_id": {
"type": "string"
},
"sup_party_id": {
"type": "string",
"index": "not_analyzed" <--- add this
},
"location": {
"type": "geo_point"
}
}
}
}
}'

elasticsearch mapping analyzer - GET not getting result

I am trying to create an analyzer, which replaces special character with a whitespace and convert it into uppercase. then after, if I want to search with lowercase also it should work.
Mapping Analyzer:
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XPUT 'http://localhost:9200/aida' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"uppercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1 "
}
}
}
}
}
'
{"acknowledged":true}
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XPOST 'http://localhost:9200/aida/_analyze?pretty' -d '{
"analyzer":"my_analyzer",
"text":"My name is Soun*arya?jwnne&yuuk"
}'
It is tokenizing the words properly by replacing the special character with the whitespace. Now if I search a word from the text, it is not retrieving me any result.
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XGET 'http://localhost:9200/aida/_search' -d '{
"query":{
"match":{
"text":"My"
}
}
}'
I am not getting any result out of the above GET query. Getting result like :
soundarya#soundarya-VirtualBox:~/Downloads/elasticsearch-2.4.0/bin$ curl -XGET 'http://localhost:9200/aida/_search' -d '{
"query":{
"match":{
"text":"my"
}
}
}'
{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Can anyone help me with this! Thank you!
You don't seem to have indexed any data after creating your index. The call to _analyze will not index anything but simply show you how the content you send to ES would be analyzed.
First, you need to create your index by specifying a mapping in which you use the analyzer you've defined:
curl -XPUT 'http://localhost:9200/aida' -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"uppercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1 "
}
}
}
},
"mappings": { <--- add a mapping type...
"doc": {
"properties": {
"text": { <--- ...with a field...
"type": "string",
"analyzer": "my_analyzer" <--- ...using your analyzer
}
}
}
}
}'
Then you can index a new real document:
curl -XPOST 'http://localhost:9200/aida/doc' -d '{
"text": "My name is Soun*arya?jwnne&yuuk"
}'
Finally, you can search:
curl -XGET 'http://localhost:9200/aida/_search' -d '{
"query":{
"match":{
"text":"My"
}
}
}'

Elasticsearch terms filter returns no results

I have a bunch of documents with an array field like this:
{ "feed_uids": ["math.CO", "cs.IT"] }
I would like to find all documents that contain some subset of these values i.e. treat them as tags. Documentation leads me to believe a terms filter should work:
{ "query": { "filtered": { "filter": { "terms": { "feed_uids": [ "cs.IT" ] } } } } }
However, the query matches nothing. What am I doing wrong?
The terms-filter works as you expect. I guess your problem here is that you have a mapping where feed_uids is using the standard analyzer.
This is quite a common problem which is described in a bit more depth here: Troubleshooting Elasticsearch searches, for Beginners
Here is a runnable example showcasing how it works if you specify "index": "not_analyzed" for the field: https://www.found.no/play/gist/bc957d515597ec8262ab
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{
"mappings": {
"type": {
"properties": {
"feed_uids": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"feed_uids":["math.CO","cs.IT"]}
{"index":{"_index":"play","_type":"type"}}
{"feed_uids":["cs.IT"]}
'
# Do searches
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"filtered": {
"filter": {
"terms": {
"feed_uids": [
"cs.IT"
]
}
}
}
}
}
'

Indexing website/url in Elastic Search

I have a website field of a document indexed in elastic search. Example value: http://example.com . The problem is that when I search for example, the document is not included. How to map correctly the website/url field?
I created the index below:
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_html":{
"type":"custom",
"tokenizer": "standard",
"filter":"standard",
"char_filter": "html_strip"
}
}
}
}
},
"mapping":{
"blogshops": {
"properties": {
"category": {
"properties": {
"name": {
"type": "string"
}
}
},
"reviews": {
"properties": {
"user": {
"properties": {
"_id": {
"type": "string"
}
}
}
}
}
}
}
}
}
I guess you are using standard analyzer, which splits http://example.dom into two tokens - http and example.com. You can take a look http://localhost:9200/_analyze?text=http://example.com&analyzer=standard.
If you want to split url, you need to use different analyzer or specify our own custom analyzer.
You can take a look how would be url indexed with simple analyzer - http://localhost:9200/_analyze?text=http://example.com&analyzer=simple. As you can see, now is url indexed as three tokens ['http', 'example', 'com']. If you don't want to index tokens like ['http', 'www'] etc, you can specify your analyzer with lowercase tokenizer (this is the one used in simple analyzer) and stop filter. For example something like this:
# Delete index
#
curl -s -XDELETE 'http://localhost:9200/url-test/' ; echo
# Create index with mapping and custom index
#
curl -s -XPUT 'http://localhost:9200/url-test/' -d '{
"mappings": {
"document": {
"properties": {
"content": {
"type": "string",
"analyzer" : "lowercase_with_stopwords"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"filter" : {
"stopwords_filter" : {
"type" : "stop",
"stopwords" : ["http", "https", "ftp", "www"]
}
},
"analyzer": {
"lowercase_with_stopwords": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [ "stopwords_filter" ]
}
}
}
}
}' ; echo
curl -s -XGET 'http://localhost:9200/url-test/_analyze?text=http://example.com&analyzer=lowercase_with_stopwords&pretty'
# Index document
#
curl -s -XPUT 'http://localhost:9200/url-test/document/1?pretty=true' -d '{
"content" : "Small content with URL http://example.com."
}'
# Refresh index
#
curl -s -XPOST 'http://localhost:9200/url-test/_refresh'
# Try to search document
#
curl -s -XGET 'http://localhost:9200/url-test/_search?pretty' -d '{
"query" : {
"query_string" : {
"query" : "content:example"
}
}
}'
NOTE: If you don't like to use stopwords here is interesting article stop stopping stop words: a look at common terms query

Resources