why do all my ElasticSearch more-like-this hits have a score of zero? - morelikethis

I have a big feed of news articles that I'm indexing. I'd like to avoid indexing a lot of articles that are nearly the same (for example, articles from a news service might appear many times with slightly different date formats).
So I thought I'd do a more-like-this query with each article. If I get back a hit with a score > some cutoff, then I figure the article is already indexed, and I don't bother with it.
But when I run my more-like-this query, all the hits I get come back with a score of zero. I can't tell if that's expected, if I'm doing something wrong, or if I've discovered a bug.
My query looks like:
POST _search
{"query":
{"bool":
{"filter": [
{"more_like_this":
{"fields": ["text"],
"like": "Doctor Sentenced In $3.1M Health Care Fraud Scheme Justice Department Documents & Publications \nGreenbelt, Maryland - U.S. District Judge Deborah K. Chasanow sentenced physician [snip]"
}
}
]
}
}
And the results I get back are:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 390,
"max_score": 0,
"hits": [
[snip]

The reason is because you have your MLT query inside a filter query. Filter queries always return a score of zero. Put your MLT within a Must or Should query and you will get back scores.

I was facing similar issue today, more_like_this query was not returning result to me. as i was using non-default routing and not passing _routing.
My query looks like below, i had to search in article in default_11 index in document fields keywords and contents.
GET localhost:9200/alias_default/articles/_search
{
"more_like_this": {
"fields": [
"keywords",
"contents"
],
"like": {
"_index": "default_11",
"_type": "articles",
"_routing": "6",
"_id": "1000000000006000000000000000014"
},
"min_word_length": 2,
"min_term_freq": 2
}
}
Also keep in mind passing _routing parameter.
This issue typically occurs when documents are indexed with non-default routing
See: ElasticSearch returns document in search but not in GET

You get zero score because the Filter part of the Bool operator is not included in the calculation of the score. It is used only to filter results. You should use the MUST operator to get a score.
POST _search
{"query":
{"bool":
{"must": [
{"more_like_this":
{"fields": ["text"],
"like": "Doctor Sentenced In $3.1M Health Care Fraud Scheme Justice Department Documents & Publications \nGreenbelt, Maryland - U.S. District Judge Deborah K. Chasanow sentenced physician [snip]"
}
}
]
}
}
For more information, see the doc
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

Related

Get All records From Elastic Search without size

My query is something like this :
{ "from": 0, "size": 100,"track_total_hits": true, "query": {"bool": {"filter": [{
"bool": {
"must_not": {
"exists": {
"field": "deleted_at"
}
}
}}]}}, "sort": [{ "added_at" : {"order" : "desc"}}]}
Now If I don't specify size it gives only 10 records . And I don't know how many records are there . So what is possible thing to retrieve all data at once or even get count ?
hits.total.value will give you the value of the total number of documents, matching the search query.
track_total_hits value defaults to 10,000. If the number of documents is more than 10,000, then the relationship will change to gte, instead of eq
Refer to this official documentation, to know more about hits.total
"hits": {
"total": {
"value": 11, // note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
Elasticsearch returns datas with size option. If you want to fetch all datas,
you should use pagination with scroll api
You can follow this link
What is the version of ES you are using? With ES 7.x count of matching documents will be present in the response under hits -> total -> value

Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).
What I need
I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:
Index mapping:
PUT /sample
{
"settings": {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"doc": {
"properties": {
"name": { "type": "keyword" },
"description": { "type": "text" }
}
}
}
}
Two sample documents:
POST /sample/doc
{
"name": "Jack Beauregard",
"description": "An aging hero"
}
POST /sample/doc
{
"name": "Master Splinter",
"description": "This rat is a hero, a real hero!"
}
...and the query:
POST /sample/_search
{
"query": {
"match": { "description": "hero" }
},
"_source": false
}
... which gives me:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.22396864,
"hits": [
{
"_index": "sample",
"_type": "doc",
"_id": "hoDsm2oB22SyyA49oDe_",
"_score": 0.22396864
},
{
"_index": "sample",
"_type": "doc",
"_id": "h4Dsm2oB22SyyA49xDf8",
"_score": 0.22227617
}
]
}
}
So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).
What I tried / could try (but it's ugly)
I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.
Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.
Thank's a lot in advance!
I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

Unexpected Match query scoring on a FirstMiddleLast field

I am using a match query to search a fullName field which contains names in (first [middle] last) format. I have two documents, one with "Brady Holt" as the fullName and the other as "Brad von Holdt". When I search for "brady holt", the document with "Brad von Holdt" is scored higher than the document with "Brady Holt" even though it is an exact match. I would expect the document with "Brady Holt" to have the highest score. I am guessing it has something to do with the 'von' middle name causing the score to be higher?
These are my documents:
[
{
"id": 509631,
"fullName": "Brad von Holdt"
},
{
"id": 55425,
"fullName": "Brady Holt"
}
]
This is my query:
{
"query": {
"match": {
"fullName": {
"query": "brady holt",
"fuzziness": 1.0,
"prefix_length": 3,
"operator": "and"
}
}
}
}
This is the query result:
"hits": [
{
"_index": "demo",
"_type": "person",
"_id": "509631",
"_score": 2.4942014,
"_source": {
"id": 509631,
"fullName": "Brad von Holdt"
}
},
{
"_index": "demo",
"_type": "person",
"_id": "55425",
"_score": 2.1395948,
"_source": {
"id": 55425,
"fullName": "Brady Holt"
}
}
]
A good read on how Elasticsearch does scoring, and how to manipulate relevancy, can be found in the Elasticsearch Guide: What is Relevance?. In particular, you may want to experiment with the explain functionality of a search query.
The shortest answer for you here is that the score of a hit is the product of its best-matching term according to a TF/IDF calculation. The number of matching terms will affect which documents are matched, but it's the "best" term that determine's a document's score. Your query doesn't have an "exact" match, per se: it has multiple matching terms, the scores of which are calculated independently.
Tuning relevancy can be a bit of a subtle art, and depends a lot on how the fields are being analyzed, the overall frequency distributions of various terms, the queries you're running, and even how you're sharding and distributing the index within a cluster (different shards will have different term frequencies).
(It may also be relevant, so to speak, that your example has two spellings of "Holt" and "Holdt".)
In any case, getting familiar with explain functionality and the underlying scoring mechanics is a helpful next step for you here.
Also, if you want an exact phrase match, you should read the ES guide on Phrase Matching.

Is it possible to eliminate "empty" facets with Elastic Search?

I've finally managed to get Elastic Search indexing to work the way I want it to work, indexing the raw values of certain fields using subfields and not_analyzed. The facets are what I expect, however, in some cases, due to the source data having null/empty values for those fields, I get results like this in the facets section:
"things": {
"_type": "terms",
"missing": 187,
"total": 12214,
"other": 10608,
"terms": [
{
"term": "foo",
"count": 912
},
{
"term": "",
"count": 532
},
{
"term": "bar",
"count": 37
}
}
}
Note the "" in the second item. I can see why ElasticSearch wouldn't automatically exclude this, as one might want to know how many documents don't have the field. But for my purposes I'd like to just not have this returned.
Is there some way that I can configure ElasticSearch to ignore these, either in the indexing or in the query?
Try putting
"exclude" : ""
in your aggregation terms

Can I use ElasticSearch Facets as an equivalent to GROUP BY and how?

I'm wondering if I can use the ElasticSearch Facets features to replace to Group By feature used in rational databases or even in a Sphinx client?
If so, beside the official documentation, can someone point out a good tutorial to do so?
EDIT :
Let's consider an SQL table products in which I have the following fields :
id
title
description
price
etc.
I omitted the others fields in the tables because I don't want to put them into my ES index.
I've indexed my database with ElasticSearch.
A product is not unique in the index. We can have the same product with different price offers and I wish to group them by price range.
Facets gives you the number of the docs it a particular word is present for a particular field...
Now let's suppose you have an index named tweets, with type tweet and field "name"...
A facet query for the field "name" would be:
curl -XPOST "http://localhost:9200/tweets/tweet/_search?search_type=count" -d'
{
"facets": {
"name": {
"terms": {
"field": "name"
}
}
}
}'
Now the response you get is the as below
"hits": {
"total": 3475368,
"max_score": 0,
"hits": []
},
"facets": {
"name": {
"_type": "terms",
"total": 3539206,
"other": 3460406,
"terms": [
{
"term": "brickeyee",
"count": 9205
},
{
"term": "ken_adrian",
"count": 9160
},
{
"term": "rhizo_1",
"count": 9143
},
{
"term": "purpleinopp",
"count": 8747
}
....
....
This is called term facet as this is term based count...There are other facets also which can be seen here

Resources