Elasticsearch NEST: specifying Id explicitly seems to cause inconsistent search scores - elasticsearch

I have a model class that looks like this:
public class MySearchDocument
{
public string ID { get; set; }
public string Name { get; set; }
public string Description { get; set; }
public int DBID { get; set; }
}
We always use bulk indexing. By default our searches do a relatively simple multi_match with more weight given to ID and Name, like this:
{
"query": {
"multi_match": {
"query": "burger",
"fields": [
"ID^1.2",
"Name^1.1",
"Description"
],
"auto_generate_synonyms_phrase_query": true
}
}
}
I was previously just relying on Id inference, allowing Elasticsearch to use my ID property for its Id purposes, but for a few reasons it has become preferable to use DBID as the Id property in Elasticsearch. I tried this 3 different ways, separately and in combo:
Explicitly when bulk indexing: new BulkIndexOperation<MySearchDocument>(d) { Id = d.DBID }
In the ConnectionSettings using DefaultMappingFor<MySearchDocument>(d => d.IdProperty(p => p.DBID))
Using an attribute on MySearchDocument: [ElasticsearchType(IdProperty = nameof(DBID))]
Any and all of these seem to work as expected; the _id field in the indexed documents are being set to my DBID property. However, in my integration tests, search results are anything but expected. Specifically, I have a test that:
Creates a new index from scratch.
Populates it with a handful of MySearchDocuments
Issues a Refresh on the index just to make sure it's ready.
Issues a search.
Asserts that the results come back in the expected order.
With Id inference, this test consistently passes. When switching the Id field using any or all of the techniques above, it passes maybe half the time. Looking at the raw results, the correct documents are always returned, but the _score often varies for the same document from test run to test run. Sometimes the varying score is the one associated with the document whose ID field matches the search term, other times it's the score of a different document.
I've tried coding the test to run repeatedly and in parallel. I've tried waiting several seconds after issuing Refresh, just to be sure the index is ready. None of these make a difference - the test passes consistently with Id inference, and is consistently inconsistent without. I know nothing in this world is truly random, so I feel like I must be missing something here. Let me know if more details would be helpful. Thanks in advance.

Search relevancy scores are calculated per shard, and a hashing algorithm on the value of _id determines into which primary shard a given document will be indexed.
It sounds like you may be seeing the effects of this when indexing a small sample of documents across N > 1 primary shards; in this case, the local relevancy scores may be different enough to manifest in some odd looking _scores returned. With a larger set of documents and even distribution, differences in local shard scores diminish.
There are a couple of approaches that you can take to overcome this for testing purposes:
Use a single primary shard
or
Use dfs_query_then_fetch when making the search request. This tells Elasticsearch to take the local relevancy scores first in order to calculate global relevancy scores, then use global scores for _score. There is a slight overhead to using dfs_query_then_fetch.
Take a look also at the section "Relevance is Broken!" from the Elasticsearch Definitive guide; although the guide refers to Elasticsearch 2.x, much of it is still very much relevant for later versions.

Related

Type of field for prefix search in Elastic Search

I'm confused on what index type I should apply for my field for prefix search, many show search_as_you_type but I think auto complete is not what I'm going for.
I have a UUID field:
id: 34y72ca1-3739-41ff-bbec-f6d17479384c
The following terms should return the doc above:
3
34
34y72ca1
34y72ca1-3739
34y72ca1-3739-41ff-bbec-f6d17479384c
Using 3739 should not return it as it doesn't start with 3739. Initially this is what I was going for but then the wildcard field is not supported by Amazon AWS, so I compromise for prefix search instead of partial search.
I tried search_as_you_type field but it doesn't return the result when I use the whole ID. Actually, my use case is when user click enter, the results will be shown, instead of real-live when they type, so if speed is compromised its OK, just that I hope for something that will be good for many rows of data.
Thanks
If you have not explicitly defined any index mapping, then you need to use id.keyword field instead of the id field for the prefix query to show the appropriate results. This uses the keyword analyzer instead of the standard analyzer
{
"query": {
"prefix": {
"id.keyword": {
"value": "34y72ca1"
}
}
}
}
Otherwise, you can modify your index mapping, by adding multi fields for id field

Discover historical trends in Elasticsearch (not visual)

I have some experience with Elastic as logs storage, but I'm stuck on basic trends recognition (where I need to compare found documents to each other) over time periods.
Easy query would answer following question:
Find all occurrences of document rows (row is specified by growing/continues #timestamp value), where specific field (e.g. threads_count) is growing for fixed count of documents, or time period.
So if I have thread_count of some application, logged every minute over a day including timestamp. And I specify that I'm looking for growing trend in 10 minutes - result should return documents or document sets where thread_count was greater over the one from document minute before at least for 10 documents.
It is very similar task to see line graph, and identify growing parts by eye.
Maybe I just miss proper function name for search. I'm not interested in visualization, I would like to search similar situations over the API and take needed actions.
Any reference to documentation or simple example is welcome!
Well script cannot be used between documents. So you will have to use a payload.
In your query sort the result by date.
https://www.elastic.co/guide/en/elastic-stack-overview/6.3/how-watcher-works.html
A script in the payload could tell you if a field is increasing (something like that, don't have access to a es index right now)
"transform": {
"script": {
"source": "ctx.payload.transform = []; def current_score = -1;
def current = []; for (int j=0;j<ctx.payload.hits.hits;j++){
//check in the loop if current_score increasing using ctx.payload.hits.hits[j]._source.message], if not return "FALSE"
} ; return "TRUE",
"lang": "painless"
}
}
If you use logstash to index your documents, take a look to elapsed, could be nice too: https://www.elastic.co/guide/en/logstash/current/plugins-filters-elapsed.html

Elasticsearch - query primary and secondary attribute with different terms

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:
Example:
I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.
That means, I want to query for John Doe Back Street with the following sample data:
{
"fullname" : "John Doe John and Jane",
"street" : "Main Street"
}
{
"fullname" : "John Doe",
"street" : "Back Street"
}
Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.
Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:
Term frequency
Inverse document frequency
Field-length norm
Shortly:
the often the term occurs in field, the MORE relevant is
the often the term occurs in entire index, the LESS relevant is
the longer the term is, the MORE relevant is
I recommend you to read below materials:
What Is Relevance?
Theory Behind Relevance Scoring
Controlling Relevance and subpages
If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:
{
"query": {
"multi_match": {
"query": "john doe",
"fields": [
"fullname^10",
"street"
]
}
}
}
In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.
I know that I did not answer directly but I hope to helped you to understand how this works.

Elasticsearch projections onto new type

Is it possible to get a projection as a query result in elasticsearch?
For example:
I have 3 types in my index:
User { Id, Name, Groups[], Location { Lat, Lon } }
Group { Id, Name, Topics[] }
Message { Id, UserId, GroupId, Content}
And I want to get the number of messages and users in a group in a given area, so my input would be:
{ Lat, Lon, Distance, GroupId }
and the output would be:
Group { Id, Name, Topics, NumberOfUsers, NumberOfMessages }
where the actual output of the query is a combination of data returned by the query and aggregations within that data.
Is this possible?
There are no JOINs in Elasticsearch (except for parent-child, but those shouldn't be used for heavy joining either). With your current data model you'll only be able to to application-side JOINs and depending on your actual data that might be a lot of roundtrips. I don't think this will work out too well.
PS: Generally, please provide some simple test documents with usable data. If I have to put together a test data set to try out your problem, your chances that anybody will actually try it will get rather slim.

Why not use min_score with Elasticsearch?

New to Elasticsearch. I am interested in only returning the most relevant docs and came across min_score. They say "Note, most times, this does not make much sense" but doesn't provide a reason. So, why does it not make sense to use min_score?
EDIT: What I really want to do is only return documents that have a higher than x "score". I have this:
data = {
'min_score': 0.9,
'query': {
'match': {'field': 'michael brown'},
}
}
Is there a better alternative to the above so that it only returns the most relevant docs?
thx!
EDIT #2:
I'm using minimum_should_match and it returns a 400 error:
"error": "SearchPhaseExecutionException[Failed to execute phase [query], all shards failed;"
data = {
'query': {
'match': {'keywords': 'michael brown'},
'minimum_should_match': '90%',
}
}
I've used min_score quite a lot for trying to find documents that are a definitive match to a given set of input data - which is used to generate the query.
The score you get for a document depends on the query, of course. So I'd say try your query in many permutations (different keywords, for example) and decide which document is the first you would rather it didn't return for each, and and make a note of each of their scores. If the scores are similar, this would give you a good guess at the value to use for your min score.
However, you need to bear in mind that score isn't just dependant on the query and the returned document, it considers all the other documents that have data for the fields you are querying. This means that if you test your min_score value with an index of 20 documents, this score will probably change greatly when you try it on a production index with, for example, a few thousands of documents or more. This change could go either way, and is not easily predictable.
I've found for my matching uses of min_score, you need to create quite a complicated query, and set of analysers to tune the scores for various components of your query. But what is and isn't included is vital to my application, so you may well be happy with what it gives you when keeping things simple.
I don't know if it's the best solution, but it works for me (java):
// "tiny" search to discover maxScore
// it is fast, because it returns only 1 item
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setSize(1)
.execute()
.actionGet();
// get the maxScore and
// and set minScore = 70%
float maxScore = response.getHits().maxScore();
float minScore = maxScore * 0.7;
// second round with minimum score
SearchResponse response = client.prepareSearch(INDEX_NAME)
.setTypes(TYPE_NAME)
.setQuery(queryBuilder)
.setMinScore(minScore)
.execute()
.actionGet();
I search twice, but the first time it's fast because it returns only 1 item, then we can get the max_score
NOTE: minimum_should_match work different. If you have 4 queries, and you say minimum_should_match = 70%, it doesn't mean that item.score should be > 70%. It means that the item should match 70% of the queries, that is minimum 3/4 queries

Resources