Disable Fuzziness in Elasticsearch MatchQuery - elasticsearch

Our team wants to query a referenceId in our Elasticsearch indices. We want to find the hit with referenceId that exactly matches our input.
We can't use TermQuery as this ID is stored as text. So we ended up using MatchQuery.
Here's the code for our ElasticSearchHelper:
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
.query(QueryBuilders.matchQuery(key, val).fuzziness(Fuzziness.ZERO))
.timeout(TIMEOUT_SECONDS);
SearchRequest searchRequest = new SearchRequest().indices(index).source(searchSourceBuilder);
return restHighLevelClient.search(searchRequest);
Although we have set Fuzziness to zero, we are still getting Fuzzy hits:
Here's the search query: referenceId: 106-0638778-542266
And Here are the search hits:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 29.930355,
"hits": [
{
"_index": "XXX_V1",
"_type": "_doc",
"_id": "21-9689252-9991524",
"_score": 29.930355,
"_source": {
"id": "21-9689252-9991524",
"referenceId": "106-0638778-5422664",
},
{
"_index": "XXX_V1",
"_type": "_doc",
"_id": "21-3424596-5516719",
"_score": 19.949657,
"_source": {
"id": "P21-3424596-5516719",
"referenceId": "106-0638778-5422661",
},
{...}
}]
Note that all these hits has different referenceId than 106-0638778-542266
I just want to know how should I disable Fuzziness and only get hit with exact match? I would really appreciate help.
Thanks!

A standard tokenizer will break the id into numeric types as follows
{
"tokens" : [
{
"token" : "106",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "0638778",
"start_offset" : 4,
"end_offset" : 11,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "54226",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<NUM>",
"position" : 2
}
]
}
Match query on numeric types behaves differently.
I would suggest using match phrase query instead.

Related

where does the value for the _doc come from when sort in elastic search

I am using 7.10.1, and I have put into the index with the following data:
PUT /lib15/_doc/1
{
"price":32
}
PUT /lib15/_doc/2
{
"price":21
}
PUT /lib15/_doc/3
{
"price":48
}
PUT /lib15/_doc/4
{
"price":40
}
PUT /lib15/_doc/5
{
"price":42
}
Then I do the following query,
GET /lib15/_search
{
"size": 2,
"query": {
"match_all": {}
},
"sort": [
{
"price": "desc"
},
{
"_doc": "desc"
}
]
}
The result is:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "lib15",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"price" : 48
},
"sort" : [
48,
2
]
},
{
"_index" : "lib15",
"_type" : "_doc",
"_id" : "5",
"_score" : null,
"_source" : {
"price" : 42
},
"sort" : [
42,
4
]
}
]
}
}
I would ask 2 and 4 in "sort":[48,2] and "sort":[42,4] come from? Are they _doc value? but they are not equal to the _id.
As mentioned in the official documentation of sort, _doc is used to sort by index order.
it means your document containing price was indexed in 2nd and price containing document was indexed in 4th order.
Update:, I used the same order of insertion which you provided and was able to get the same order which you provided, although in the example we are indexing price:48 in 3rd and price:42 in 5th order, but when you use the GET api with these document-id, it prints the _seq_no which is 2 and 4 as shown below:
GET http://localhost:9900/lib15/_doc/3
{
"_index": "lib15",
"_type": "_doc",
"_id": "3",
"_version": 1,
"_seq_no": 2, // note for id 3, seq_no is 2
"_primary_term": 1,
"found": true,
"_source": {
"price": 48
}
}
And GET http://localhost:9900/lib15/_doc/5
{
"_index": "lib15",
"_type": "_doc",
"_id": "5",
"_version": 1,
"_seq_no": 4, // // note for id 5, seq_no is 4
"_primary_term": 1,
"found": true,
"_source": {
"price": 42
}
}

Elasticsearch: Count terms in document

I'm fairly new to elasticsearch, use version 6.5. My database contains website pages and their content, like this:
Url Content
abc.com There is some content about cars here. Lots of cars!
def.com This page is all about cars.
ghi.com Here it tells us something about insurances.
jkl.com Another page about cars and how to buy cars.
I have been able to perform a simple query that returns all documents that contain the word "cars" in their content (using Python):
current_app.elasticsearch.search(index=index, doc_type=index,
body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}},
"from": 0, "size": 100})
Result looks something like this:
{'took': 2521,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index':
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571,
'_source': {'content': '....'}}]}}
The "_id"s are referring to a domain, so I basically get back:
abc.com
def.com
jkl.com
But I now want to know how often the searchterm ("cars") is present in each document, like:
abc.com: 2
def.com: 1
jkl.com: 2
I found several solutions how to obtain the number of documents that contain the searchterm, but none that would tell how to get the number of terms in a document. I also couldn't find anything in the official documentation, although I'm pretty sure is in there somewhere and I'm maybe just not realising that it is the solution for my problem.
Update:
As suggested by #Curious_MInd I tried term aggregation:
current_app.elasticsearch.search(index=index, doc_type=index,
body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content"
}}}})
Result:
{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful':
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0,
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252',
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations':
{'skala_count': {'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0, 'buckets': []}}}
I don't see where it would display the counts per document here, but I'm assuming that's because "buckets" is empty? On another note: The results found by term aggregation are significantly worse than those with multi_match query. Is there any way to combine those?
What you are trying to achieve can't be done in a single query. The first query will be to filter and get the doc Ids for which the terms counts is required.
Lets assume you have the following mapping:
{
"test": {
"mappings": {
"_doc": {
"properties": {
"details": {
"type": "text",
"store": true,
"term_vector": "with_positions_offsets_payloads"
},
"name": {
"type": "keyword"
}
}
}
}
}
}
Assuming you query returns the following two docs:
{
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"details": "There is some content about cars here. Lots of cars!",
"name": "n1"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"details": "This page is all about cars",
"name": "n2"
}
}
]
}
}
From the above response you can get all the document ids that matched your query. For above we have : "_id": "1" and "_id": "2"
Now we use _mtermvectors api to get the frequency(count) of each term in a given field:
test/_doc/_mtermvectors
{
"docs": [
{
"_id": "1",
"fields": [
"details"
]
},
{
"_id": "2",
"fields": [
"details"
]
}
]
}
The above returns the following result:
{
"docs": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 2,
"tokens": [
{
"position": 5,
"start_offset": 28,
"end_offset": 32
},
{
"position": 9,
"start_offset": 47,
"end_offset": 51
}
]
},
....
}
}
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_version": 1,
"found": true,
"took": 2,
"term_vectors": {
"details": {
"field_statistics": {
"sum_doc_freq": 15,
"doc_count": 2,
"sum_ttf": 16
},
"terms": {
....
,
"cars": {
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 23,
"end_offset": 27
}
]
},
....
}
}
}
]
}
Note that I have used .... to denote other terms data in the field since the term vector api return the term related details for all the terms.
You can definitely extract the info about the required term from the above response, here I have shown for cars and the field you are interested in is term_freq
I guess you need Term Aggregation here like below, See
GET /_search
{
"aggs" : {
"cars_count" : {
"terms" : { "field" : "Content" }
}
}
}

search data in elastic search based on some fields

I am new to EL and want to search on this data based on "type:": "load".
Please help
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1109,
"max_score": 1,
"hits": [
{"_index": "4",
"_type": "aa",
"_id": "xx",
"_score": 1,
"_source": {
"useRange": false,
"Blueprint": 4,
"standardDeviation": 0,
"occurrences": 0,
"type:": "load",
}...
{
}
Elasticsearch Documentation will help you:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
EDIT
Query is curl -XGET 'localhost:9200/sample/_search?q=type:load&pretty'
and Output will be
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "sample",
"_type" : "data",
"_id" : "1",
"_score" : 0.30685282,
"_source" : {
"useRange" : false,
"Blueprint" : 4,
"standardDeviation" : 0,
"occurrences" : 0,
"type" : "load"
}
} ]
}
}
Issue was with field name 'type' we change the name to typemetrics and below query is working
i thing type might be acting as keyword
" GET /4/_search
{
"query": {
"term" : { "typemetrics" : "load"}
}
} "

MLT (More Like This) elasticsearch query

I'm trying to use elasticsearch MLT (More Like This) query.
Only one doc in store:
{
"_index": "monitors",
"_type": "monitor",
"_id": "AVTnvJ8SancUpEdFLMiq",
"_score": 1,
"_source": {
"ProcessGroup": "test",
"ProcessName": "test",
"OpName": "test",
"Domain": "test",
"LogLevel": "Info",
"StartDateTime": "2016-05-04 04:46:47",
"EndDateTime": "2016-05-04 04:47:47",
"MessageDateTime": "2016-05-04 04:46:47",
"ApplicationCode": "test",
"Status": "10",
}
}
Query:
POST /_search
{
"query": {
"more_like_this" : {
"fields" : ["ProcessName"],
"like" : "test",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
ProcessName is a not analyzed field.
I was expected to get this document as a response, but instead i got nada:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Why is that ?
Another question:
Suppose I have search engines docs, and I search for "stph". I expect to get "Stephan Curry" suggestion because it's commonly searched. Fuzzy search doesn't fit because distance is greater than 2, so does using MLT query is a good option for this scenario ?

Find actual matching word when using fuzzy query in elastic search

I am new to elasticsearch and was looking around fuzzy query search.
I have made a new index products with object/record values like this
{
"_index": "products",
"_type": "product",
"_id": "10",
"_score": 1,
"_source": {
"value": [
"Ipad",
"Apple",
"Air",
"32 GB"
]
}
}
Now when i am performing a fuzzy query search in elasticsearch like
{
query: {
fuzzy: {
value: "tpad"
}
}
}
It returns me the correct record (the product just made above) which is expected.
And i know that the term tpad matches ipad so record was return.
But technically how would i know that it has matched ipad. Elastic search just returns the full record(or records) like this
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.61489093,
"hits": [
{
"_index": "products",
"_type": "product",
"_id": "10",
"_score": 0.61489093,
"_source": {
"value": [
"Ipad",
"Apple",
"Air",
"32 GB"
]
}
}
]
}
}
Is there any way in elastic search so that i can know if it has matched tpad against ipad
if you use highlighting, Elasticsearch will show the terms that matched:
curl -XGET http://localhost:9200/products/product/_search?pretty -d '{
"query" : {
"fuzzy" : {
"value" : "tpad"
}
},
"highlight": {
"fields" : {
"value" : {}
}
}
}'
Elasticsearch will return matching documents with the fragment highlighted:
{
"took" : 31,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.13424811,
"hits" : [ {
"_index" : "products",
"_type" : "product",
"_id" : "10",
"_score" : 0.13424811,
"_source":{
"value" : ["Ipad",
"Apple",
"Air",
"32 GB"
]
},
"highlight" : {
"value" : [ "<em>Ipad</em>" ]
}
} ]
}
}
if you just want to analyze the result, you could use the Inquisitor plugin.
If you need to do this programmatically, I think the highlighting feature will help you:
Determining which words were matched in a fuzzy search
I know the question is older but I just ran into it. The way I do it is by populating the query name field when building the query. This way it will come back inside the "matchedQuery" field in response. Hope this helps :)

Resources