How to get exact match phrase more than one - elasticsearch

Below is the query to get the exact match
GET courses/_search
{
"query": {
"term" : {
"name.keyword": "Anthropology 230"
}
}
}
I need to find the Anthropology 230 and Anthropology 250 also
How to get the exact match

You can check and try with, match, match_phrase or match_phrase_prefix
Using match,
GET courses/_search
{
"query": {
"match" : {
"name" : "Anthropology 230"
}
},
"_source": "name"
}
Using match_phrase,
GET courses/_search
{
"query": {
"match_phrase" : {
"name" : "Anthropology"
}
},
"_source": "name"
}
OR using regexp,
GET courses/_search
{
"query": {
"regexp" : {
"name" : "Anthropology [0-9]{3}"
}
},
"_source": "name"
}

The mistake that you are doing is that you are using the term query on keyword field and both of them are not analyzed, which means they try to find the exact same search string in inverted index.
What you should be doing is: define a text field which you anyway will have if you have not defined your mapping. I am also assuming the same as in your query you mentioned .keyword which gets created automatically if you don't define mapping.
Now you can just use below match query which is analyzed and uses standard analyzer which splits the token on whitespace, so Anthropology 250 and 230 will be generated for your 2 sample docs.
Simple and efficient query which brings both the docs
{
"query": {
"match" : {
"name" : "Anthropology 230"
}
}
}
And search result
"hits": [
{
"_index": "matchterm",
"_type": "_doc",
"_id": "1",
"_score": 0.8754687,
"_source": {
"name": "Anthropology 230"
}
},
{
"_index": "matchterm",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"name": "Anthropology 250"
}
}
]
The reason why above query matched both docs is that it created two tokens anthropology and 230 and matches anthropology in both of the documents.
You should definitely read about the analysis process and can also try analyze API to see the tokens generated for any text.
Analyze API output for your text
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"analyzer": "standard",
"text": "Anthropology 250"
}
{
"tokens": [
{
"token": "anthropology",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "250",
"start_offset": 13,
"end_offset": 16,
"type": "<NUM>",
"position": 1
}
]
}

Assuming you may have more 'Anthropology nnn' items, this should do what you need:
"query":{
"bool":{
"must":[
{"term": {"name.keyword":"Anthropology 230"}},
{"term": {"name.keyword":"Anthropology 250"}},
]
}
}

Related

How to search over all fields and return every document containing that search in elasticsearch?

I have a problem regarding searching in elasticsearch.
I have a index with multiple documents with several fields. I want to be able to search over all the fields running a query and want it to return all the documents that contains the value specified in the query. I Found that using simple_query_string worked well for this. However, it does not return consistent results. In my index I have documents with several fields that contain dates. For example:
"revisionDate" : "2008-01-01T00:00:00",
"projectSmirCreationDate" : "2008-07-01T00:00:00",
"changedDate" : "1971-01-01T00:00:00",
"dueDate" : "0001-01-01T00:00:00",
Those are just a few examples, however when I index for example:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "2008"
}
}
}
It only returns two documents, this is a problem because I have much more documents than just two that contains the value "2008" in their fields.
I also have problem searching file names.
In my index there are fields that contain fileNames like this:
"fileName" : "testPDF.pdf",
"fileName" : "demo.pdf",
"fileName" : "demo.txt",
When i query:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "demo"
}
}
}
I get no results
But if i query:
GET new_document-20_v2/_search
{
"size": 1000,
"query": {
"simple_query_string" : {
"query": "demo.txt"
}
}
}
I get the proper result.
Is there any better way to search across all documents and fields than I did? I want it to return all the document matching the query and not just two or zero.
Any help would be greatly appreciated.
Elasticsearch uses a standard analyzer if no analyzer is specified. Since no analyzer is specified on "fileName", demo.txt gets tokenized to
{
"tokens": [
{
"token": "demo.txt",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Now when you are searching for demo it will not give any result, but searching for demo.txt will give the result.
You can instead use a wildcard query to search for a document having demo in fileName
{
"query": {
"wildcard": {
"fileName": {
"value": "demo*"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "67303015",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"fileName": "demo.pdf"
}
},
{
"_index": "67303015",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"fileName": "demo.txt"
}
}
]
Since revisionDate, projectSmirCreationDate, changedDate, dueDate are all of type date, so you cannot do a partial search on these dates.
You can use multi-fields, to add one more field (of text type) in the above fields. Modify your index mapping as shown below
{
"mappings": {
"properties": {
"changedDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"projectSmirCreationDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"dueDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
},
"revisionDate": {
"type": "date",
"fields": {
"raw": {
"type": "text"
}
}
}
}
}
}
Index Data:
{
"revisionDate": "2008-02-01T00:00:00",
"projectSmirCreationDate": "2008-02-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
{
"revisionDate": "2008-01-01T00:00:00",
"projectSmirCreationDate": "2008-07-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
Search Query:
{
"query": {
"multi_match": {
"query": "2008"
}
}
}
Search Result:
"hits": [
{
"_index": "67303015",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"revisionDate": "2008-01-01T00:00:00",
"projectSmirCreationDate": "2008-07-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
},
{
"_index": "67303015",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"revisionDate": "2008-02-01T00:00:00",
"projectSmirCreationDate": "2008-02-01T00:00:00",
"changedDate": "1971-01-01T00:00:00",
"dueDate": "0001-01-01T00:00:00"
}
}
]

Elasticsearch match vs. term in filter

I don't see any difference between term and match in filter:
POST /admin/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber": "j1knd"
}
}
]
}
}
}
And the result contains not exactly matched partnumbers too, e.g.: "52527.J1KND-H"
Why?
Term queries are not analyzed and mean whatever you send will be used as it is to match the tokens in the inverted index, while match queries are analyzed and the same analyzer applied on the fields, which is used at index time and accordingly matches the document.
Read more about term query and match query. As mentioned in the match query:
Returns documents that match a provided text, number, date or boolean
value. The provided text is analyzed before matching.
You can also use the analyze API to see the tokens generated for a particular field.
Tokens generated by standard analyzer on 52527.J1KND-H text.
POST /_analyze
{
"text": "52527.J1KND-H",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "52527",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0
},
{
"token": "j1knd",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "h",
"start_offset": 12,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Above explain to you why you are getting the not exactly matched partnumbers too, e.g.: "52527.J1KND-H", I would take your example and how you can make it work.
Index mapping
{
"mappings": {
"properties": {
"partnumber": {
"type": "text",
"fields": {
"raw": {
"type": "keyword" --> note this
}
}
}
}
}
}
Index docs
{
"partnumber" : "j1knd"
}
{
"partnumber" : "52527.J1KND-H"
}
Search query to return only the exact match
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber.raw": "j1knd" --> note `.raw` in field
}
}
]
}
}
Result
"hits": [
{
"_index": "so_match_term",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"partnumber": "j1knd"
}
}
]
}

getting data without matching full string in elastic search query

my data is stored in the elastic search below format
{
"_index": "wallet",
"_type": "wallet",
"_id": "5dfcbe0a6ca963f84470d852",
"_score": 0.69321066,
"_source": {
"email": "test20011#gmail.com",
"wallet": "test20011#operatorqa2.akeodev.com",
"countryCode": "+91",
"phone": "7916318809",
"name": "test20011"
}
},
{
"_index": "wallet",
"_type": "wallet",
"_id": "5dfcbe0a6ca9634d1c70d856",
"_score": 0.69321066,
"_source": {
"email": "test50011#gmail.com",
"wallet": "test50011#operatorqa2.akeodev.com",
"countryCode": "+91",
"phone": "3483330496",
"name": "test50011"
}
},
{
"_index": "wallet",
"_type": "wallet",
"_id": "5dfcbe0a6ca96304b370d857",
"_score": 0.69321066,
"_source": {
"email": "test110021#gmail.com",
"wallet": "test110021#operatorqa2.akeodev.com",
"countryCode": "+91",
"phone": "2744697207",
"name": "test110021"
}
}
The record should not find if we are using below query
{
"query": {
"bool": {
"should": [
{
"match": {
"wallet": {
"query": "operatorqa2.akeodev.com",
"operator": "and"
}
}
},
{
"match": {
"email": {
"query": "operatorqa2.akeodev.com",
"operator": "and"
}
}
}
]
}
}
}
the record should find if I am passing below Query
{
"query": {
"bool": {
"should": [
{
"match": {
"wallet": {
"query": "test20011#operatorqa2.akeodev.com",
"operator": "and"
}
}
},
{
"match": {
"email": {
"query": "test20011#operatorqa2.akeodev.com",
"operator": "and"
}
}
}
]
}
}
}
I have created the index on the email and wallet field.
whenever users searching data by email or wallet and I am not sure that whatever string is sending by the user it's email or wallet so I am using bool.
the record should find if a user sends the full Email address or full Wallet Address.
Please help me to find a solution
As mentioned by the other community members, when asking questions like this you should specify the version of Elasticsearch you are using and also provide the mapping.
Starting with Elasticsearch version 5 with default mappings you would only need to change your query to query against the exact version of the field rather than the analyzed version. By default Elasticsearch maps strings to a multi-field of type text (analyzed, for full-text search) and keyword (not-analyzed, for exact match search). In your query you would then query against the <fieldname>.keyword-fields:
{
"query": {
"bool": {
"should": [
{
"match": {
"wallet.keyword": "test20011#operatorqa2.akeodev.com"
}
},
{
"match": {
"email.keyword": "test20011#operatorqa2.akeodev.com"
}
}
]
}
}
}
If you are on an Elasticsearch version prior to version 5, change the index-property from analyzed to not_analyzed and re-index your data.
Mapping snippet:
{
"email": {
"type" "string",
"index": "not_analyzed"
}
}
Your query would still not need to use the and-operator. It will look identical to the query I posted above, with the exception that you have to query against the email and wallet-fields, and not email.keyword and wallet.keyword.
I can recommend you the following blog post from Elastic related to that topic: Strings are dead, long live strings!
As I don't have a mapping to your index schema, I am assuming you are using ES defaults(you can get this using mapping API) and in your case, wallet and email fields would be defined as text with default analyzer which is the standard analyzer.
This analyzer wouldn't recognize these text as mail-ids and would create three tokens for test50011#operatorqa2.akeodev.com which you can check using the analyze APIs.
http://localhost:9200/_analyze?text=test50011#operatorqa2.akeodev.com&tokenizer=standard
{
"tokens": [
{
"token": "test50011",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "operatorqa2",
"start_offset": 10,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "akeodev.com",
"start_offset": 22,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 3
}
]
}
What you need here is custom analyzer for mails using UAX URI Mail tokenizer which is used for email fields. This would generate a proper token(just 1) for test50011#operatorqa2.akeodev.com as shown below:
http://localhost:9200/_analyze?text=test50011#operatorqa2.akeodev.com&tokenizer=uax_url_email
{
"tokens": [
{
"token": "test50011#operatorqa2.akeodev.com",
"start_offset": 0,
"end_offset": 33,
"type": "<EMAIL>",
"position": 1
}
]
}
Now as you can see it's not splitting test50011#operatorqa2.akeodev.com, hence when you search using your same query it would also generate the same token and ES works on a token to token match.
Let me know if you need any help, its very simple to setup and use.

Elasticsearch - multi_match together with short queries

I have query like this (I've removed sorting part because it doesn't matter):
GET _search
{
"query": {
"multi_match": {
"query": "somethi",
"fields": [ "title", "content"],
"fuzziness" : "AUTO",
"prefix_length" : 0
}
}
}
When running this I'm getting results like this:
"hits": [
{
"_index": "test_index",
"_type": "article",
"_id": "2",
"_score": 0.083934024,
"_source": {
"title": "Matching something abc",
"content": "This is a piece of content",
"categories": [
{
"name": "B",
"weight": 4
}
]
},
"sort": [
4,
0.083934024,
"article#2"
]
},
{
"_index": "test_index",
"_type": "article",
"_id": "3",
"_score": 0.18436861,
"_source": {
"title": "Matching something abc",
"content": "This is a piece of content containing something",
"categories": [
{
"name": "C",
"weight": 3
}
]
},
"sort": [
3,
0.18436861,
"article#3"
]
},
...
So no problem to get what is expected. However I noticed, that I remove one letter from query to have someth instead, Elasticsearch won't return any results.
This is quite strange for me. It seems multi_match is doing partial match but it somehow require to use minimum x characters. Same if I try to put in query for example omethin I will get results, but using only omethi I won't get any.
Is there any setting to set minimum number of characters in queries or maybe I would need to rewrite my query to achieve what I want? I would like to run match on multiple fields (in above query on title and content fields) that will allow partial match together with fuzzinness.
You get this behaviour because you have "fuzziness": "AUTO" parameter set, which means that in a word with more than 5 characters it is acceptable to misplace maximum of two characters. Generally, fuzziness parameter tells elasticsearch to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character. With fuzziness it is not possible to have more than two changes.
If you need to be able to search with partial matching, you could try to configure you index using Edge NGram analyzer and set it to your title and content fields. You can easily test how it works:
Create na index with following mapping:
PUT http://127.0.0.1:9200/test
{
"settings": {
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
And run this query:
curl -X POST \
'http://127.0.0.1:9200/test/_analyze?pretty=true' \
-d '{
"analyzer" : "edge_ngram_analyzer",
"text" : ["something"]
}'
As a result you'll get:
{
"tokens": [
{
"token": "so",
...
},
{
"token": "som",
...
},
{
"token": "some",
...
},
{
"token": "somet",
...
},
{
"token": "someth",
...
},
{
"token": "somethi",
...
},
{
"token": "somethin",
...
},
{
"token": "something",
...
}
]
}
And these are the tokens you'll get during search with edge_ngram_analyzer. With min_gram and max_gram you can configure minimum/maximum length of characters in a gram.
If you need to handle the case with omething etc. (missing letter at the beginning) try the same with NGram analyzer.

Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field

Example data:
PUT /test/test/1
{
"text1":"cats meow",
"text2":"12345",
"text3":"toy"
}
PUT /test/test/2
{
"text1":"dog bark",
"text2":"98765",
"text3":"toy"
}
And an example query:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats toy",
"type" : "cross_fields"
}
}
}
Returns the cat hit first and then the dog, which is what I want.
BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.
So if I search:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "phrase_prefix"
}
}
}
or
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats",
"type" : "phrase_prefix"
}
}
}
I should get the cat/ID 1, but I did not.
I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...
Sifting through the documentation really isn't helping me discover how to combine these two.
Yeah, I had to apply an analyzer...
The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.
The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.
So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.
~~~
Here are the steps to make the above question example work:
First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Then I attached this analyzer to each field.
I changed the text1 to match the field I was applying this to.
PUT /test/_mapping/test
{
"test": {
"properties": {
"text1": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
I ran GET /test/_mapping to be sure everything worked.
Then to add the data:
POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }
And the search!
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "cross_fields"
}
}
}
Which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.70778143,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.70778143,
"_source": {
"text1": "cats meow",
"text2": "12345",
"text3": "toy"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.1278426,
"_source": {
"text1": "dog bark",
"text2": "98765",
"text3": "toy"
}
}
]
}
}
This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

Resources