Elasticsearch projections onto new type - elasticsearch

Is it possible to get a projection as a query result in elasticsearch?
For example:
I have 3 types in my index:
User { Id, Name, Groups[], Location { Lat, Lon } }
Group { Id, Name, Topics[] }
Message { Id, UserId, GroupId, Content}
And I want to get the number of messages and users in a group in a given area, so my input would be:
{ Lat, Lon, Distance, GroupId }
and the output would be:
Group { Id, Name, Topics, NumberOfUsers, NumberOfMessages }
where the actual output of the query is a combination of data returned by the query and aggregations within that data.
Is this possible?

There are no JOINs in Elasticsearch (except for parent-child, but those shouldn't be used for heavy joining either). With your current data model you'll only be able to to application-side JOINs and depending on your actual data that might be a lot of roundtrips. I don't think this will work out too well.
PS: Generally, please provide some simple test documents with usable data. If I have to put together a test data set to try out your problem, your chances that anybody will actually try it will get rather slim.

Related

ElasticSearch - backward pagination with search_after when sorting value is null

I have an application which has a dashboard, basically a table with hundreds of thousands of records.
This table has up to 50 different columns. These columns have different types in mapping: keyword, text, boolean, integer.
As records in the table might have the same values, I use sorting as an array of 2 attributes:
First attribute is what client wants to sort by. It can be a simple
sorting object or some sort query with nested filter.
Second
attribute is basically a default sorting by id, needed for sorting
the documents which have identical values for the column customer
wants to sort by.
I checked multiple topics/issues on github and here
on elastic forum to understand how to implement search_after
mechanism for back sorting but it's not working for all the cases I
need.
Please have a look at the image:
Imagine there is a limit = 3, the customer right now is on the 3d page of a table and all the data is sorted by name asc, _id asc
The names are: A, B, C, D, E on the image.
The ids are numeric parts of the Doc word.
When customer wants to go back to the previous page, which is a page #2 on my picture, what I do is pass the following to elastic:
sort: [
{
name: 'desc'
},
{
_id: 'desc'
}
],
search_after: [null, Doc7._id]
As as result, I get only one document, which is Doc6: null on my image. It seems to be logical, because I ask elastic to search by desc after null and id 7 and I have only 1 doc corresponding this..it's Doc6 but it's not what I need.
I can't make up the solution to get the data that I need.
Could anyone help, please?

How to group documents of different types according to the same matching field in Elasticsearch?

To preface, I'm working with Spring Data Elasticsearch.
Let's say I have about 10k documents each of a Car entity and a Owner entity:
Car: {VIN, make, model, color}
Owner: {VIN, owner}
Let's say that each car can have 0-many owners.
In the end, I want a bunch of CarProfile objects that consist of the matching Car data and Owner data together.
CarProfile: {VIN, make, model, color, List<String> owners}
I was thinking of two approaches to this:
Index all the Car and Owner data into ES. Group the documents by VIN and traverse through each group and convert each group into a CarProfile object.
Index all the Owner data into ES. Traverse through the Car data and for each Car, retrieve any matching Owner information with the VIN, then convert all the data to a CarProfile object.
Approach 1 would be more convenient, but I'm not sure if that approach is possible. It seems like aggregations can only give you a subset of the data (like how many owners per car) but not all the document data together. Any suggestions would be welcome.
You can check field collapsing. (https://www.elastic.co/guide/en/elasticsearch/reference/7.16/collapse-search-results.html)
You can keep documents flat with following document model.
CarProfile: {VIN, make, model, color, owner}
And you can group by VIN with a query like the below query and this should give you documents grouped by VIN with all of their fields.:
{
"query": {
"match_all": {},
"collapse": {
"field": "VIN"
}
}
}

Elasticsearch NEST: specifying Id explicitly seems to cause inconsistent search scores

I have a model class that looks like this:
public class MySearchDocument
{
public string ID { get; set; }
public string Name { get; set; }
public string Description { get; set; }
public int DBID { get; set; }
}
We always use bulk indexing. By default our searches do a relatively simple multi_match with more weight given to ID and Name, like this:
{
"query": {
"multi_match": {
"query": "burger",
"fields": [
"ID^1.2",
"Name^1.1",
"Description"
],
"auto_generate_synonyms_phrase_query": true
}
}
}
I was previously just relying on Id inference, allowing Elasticsearch to use my ID property for its Id purposes, but for a few reasons it has become preferable to use DBID as the Id property in Elasticsearch. I tried this 3 different ways, separately and in combo:
Explicitly when bulk indexing: new BulkIndexOperation<MySearchDocument>(d) { Id = d.DBID }
In the ConnectionSettings using DefaultMappingFor<MySearchDocument>(d => d.IdProperty(p => p.DBID))
Using an attribute on MySearchDocument: [ElasticsearchType(IdProperty = nameof(DBID))]
Any and all of these seem to work as expected; the _id field in the indexed documents are being set to my DBID property. However, in my integration tests, search results are anything but expected. Specifically, I have a test that:
Creates a new index from scratch.
Populates it with a handful of MySearchDocuments
Issues a Refresh on the index just to make sure it's ready.
Issues a search.
Asserts that the results come back in the expected order.
With Id inference, this test consistently passes. When switching the Id field using any or all of the techniques above, it passes maybe half the time. Looking at the raw results, the correct documents are always returned, but the _score often varies for the same document from test run to test run. Sometimes the varying score is the one associated with the document whose ID field matches the search term, other times it's the score of a different document.
I've tried coding the test to run repeatedly and in parallel. I've tried waiting several seconds after issuing Refresh, just to be sure the index is ready. None of these make a difference - the test passes consistently with Id inference, and is consistently inconsistent without. I know nothing in this world is truly random, so I feel like I must be missing something here. Let me know if more details would be helpful. Thanks in advance.
Search relevancy scores are calculated per shard, and a hashing algorithm on the value of _id determines into which primary shard a given document will be indexed.
It sounds like you may be seeing the effects of this when indexing a small sample of documents across N > 1 primary shards; in this case, the local relevancy scores may be different enough to manifest in some odd looking _scores returned. With a larger set of documents and even distribution, differences in local shard scores diminish.
There are a couple of approaches that you can take to overcome this for testing purposes:
Use a single primary shard
or
Use dfs_query_then_fetch when making the search request. This tells Elasticsearch to take the local relevancy scores first in order to calculate global relevancy scores, then use global scores for _score. There is a slight overhead to using dfs_query_then_fetch.
Take a look also at the section "Relevance is Broken!" from the Elasticsearch Definitive guide; although the guide refers to Elasticsearch 2.x, much of it is still very much relevant for later versions.

Single query to return documents sorted by distance based on one documents Id rather than its geopoint

I have an index in elasticsearch which contains a Id field and a geopoint.
right now in order to get the nearest documents I have to make two queries, one to get the original document by its id and after that use its coordinates to do a geosort. I was wondering if there is anyway to execute this as a single query.
public IEnumerable<RestaurantSearchItem> GetNearbyRestaurants(double latitude, double longitude)
{
var query = _elasticClient.Search<RestaurantSearchItem>(s =>
s.Index(RestaurantSearchItem.IndexName)
.Sort(
ss =>ss.GeoDistance(
g => g
.Field(p => p.Location)
.DistanceType(GeoDistanceType.Plane)
.Unit(DistanceUnit.Meters)
.Order(SortOrder.Ascending)
.Points(new GeoLocation(latitude,longitude)))));
var nearByRestaurants = query.Documents;
foreach (var restaurant in nearByRestaurants)
{
restaurant.Distance = Convert.ToDouble(query.Hits.Single(x => x.Id == restaurant.Id).Sorts.Single());
}
return nearByRestaurants;
}
I don't think it's possible to do this in one query; the latitude and longitude used for sorting can't be looked up from elsewhere in the data, so needs to be supplied in the request.
As of my knowledge, the only Elasticsearch query that accepts id of a document as its parameter is terms query, which fetches list of terms for the query from the given document.
But you want to find relevant documents based on location, not exact terms.
This can be achieved with denormalization of your data. It might look like storing the list of nearby restaurants in a nested field.
In the case of denormalization you will have to pre-compute all nearby restaurants before inserting the document in the index.

Filter ES query based on aggregation results

We have an index with the following document structure:
{
email: "test#test.com",
stuff ..
},
{
email: "test#test.com,
stuff...
},
{
email: anotherEmail#test.com,
stuf..
}
We need to get all records where the count of distinct email is > 2 for example. I know I can use an aggregation with a mininum doc count to find all counts of all records where there are at least 2 records for an email.
But what we need to do is actually get all the records where the count of distinct email is > X. So we need our query to constrain our results to only those records that match an aggregation.
I know that we can have a nested TopHits aggregation, but that is not good enough for us, because we need to be able to page through these results... there could be records where an email has 10k records for example. We need to be able to get these results in the Hits collection so that we can page them.
How would we go about doing something like that?

Resources