Aggregating on generic nested array in Elasticsearch with NEST - elasticsearch

I'm trying to analyse data with Elasticsearch. I've started working with Elasticsearch and Nest about four months ago, so I might have missed some obvious stuff. All examples are simplified or altered, but the core is the same.
The data contains an array of nested objects, each of which also contain an array of nested objects, and again, each contains an array of nested objects. The data is obtained from an information request which contains XML messages. The messages are parsed and each element containing (multiple) text elements is saved with their element name, location, and an array with all text element names and values under the message name. I'm thinking this set-up might make analyzing the data easier.
Mapping example:
{
"data" : {
"properties" : {
"id" : { "type" : "string" },
"action" : { "type" : "string" },
"result" : { "type" : "string" },
"details" : {
"type" : "nested",
"properties" : {
"description" : { "type" : "string" },
"message" : {
"type" : "nested",
"properties" : {
"name" : { "type" : "string" },
"nodes" : {
"type" : "nested",
"properties" : {
"name" : { "type" : "string" },
"value" : { "type" : "string" }
}
},
"source" : { "type" : "string" }
}
}
}
}
}
}
}
Data example:
{
"id" : "123456789",
"action" : "GetInformation",
"result" : "Success",
"details" : [{
"description" : "Request",
"message" : [{
"name" : "Body",
"source" : "Message|Body",
"nodes" : [{
"name" : "Action",
"value" : "GetInformation"
}, {
"name" : "Identity",
"value" : "1234"
}
]
}
]
}, {
"description" : "Response",
"message" : [{
"name" : "Object",
"source" : "Message|Body|Object",
"nodes" : [{
"name" : "ID",
"value" : "123"
}, {
"name" : "Name",
"value" : "Jim"
}
]
}, {
"name" : "Information",
"source" : "Message|Body|Information",
"nodes" : [{
"name" : "Type",
"value" : "Birth City"
}, {
"name" : "City",
"value" : "Los Angeles"
}
]
}, {
"name" : "Information",
"source" : "Message|Body|Information",
"nodes" : [{
"name" : "Type",
"value" : "City of Residence"
}, {
"name" : "City",
"value" : "New York"
}
]
}
]
}
]
}
XML Example:
<Message>
<Body>
<Object>
<ID>123</ID>
<Name>Jim</Name>
</Object>
<Information>
<Type>Birth City</Type>
<City>Los Angeles</City>
<Information>
<Information>
<Type>City of Residence</Type>
<City>New York</City>
<Information>
</Body>
</Message>
I want to analyse the Name and Value properties of Nodes so I can get an overview of each city within the index that functions as a birthplace and how many people were born in them. Something like:
Dictionary<string, int> birthCities = {
{"Los Angeles", 400}, {"New York", 800},
{"Detroit", 500}, {"Michigan", 700} };
The code I have so far:
var response = client.Search<Data>(search => search
.Query(query =>
query.Match(match=> match
.OnField(data=>data.Action)
.Query("GetInformation")
)
)
.Aggregations(a1 => a1
.Nested("Messages", messages => messages
.Path(data => data.details.FirstOrDefault().Message)
.Aggregations(a2 => a2
.Terms("Sources", termSource => termSource
.Field(data => data.details.FirstOrDefault().Message.FirstOrDefault().Source)
.Aggregations(a3 => a3
.Nested("Nodes", nodes => nodes
.Path(dat => data.details.FirstOrDefault().Message.FirstOrDefault().Nodes)
.Aggregations(a4 => a4
.Terms("Names", termName => termName
.Field(data => data.details.FirstOrDefault().Message.FirstOrDefault().Nodes.FirstOrDefault().Name)
.Aggregations(a5 => a5
.Terms("Values", termValue => termValue
.Field(data => data.details.FirstOrDefault().Message.FirstOrDefault().Nodes.FirstOrDefault().Value)
)
)
)
)
)
)
)
)
)
)
);
var dict = new Dictionary<string, long>();
var sAggr = response.Aggs.Nested("Messages").Terms("Sources");
foreach (var item in sAggr.Items)
{
if (item.Key.Equals("information"))
{
var nAggr = item.Nested("Nodes").Terms("Names");
foreach (var nItem in nAggr.Items)
{
if (nItem.Key.Equals("city"))
{
var vAgg = nItem.Terms("Values");
foreach (var vItem in vAgg.Items)
{
if (!dict.ContainsKey(vItem.Key))
{
dict.Add(vItem.Key, 0);
}
dict[vItem.Key] += vItem.DocCount;
}
}
}
}
}
This code gives me every city and how many times they occur, but since they're saved with the same element name and at the same location (both of which I'm not able to change), I've found no way to distinguish between birth cities and cities of residence.
Specific types for each action are sadly not an option. So my question is: How can I count all occurrences of a city name with Birth City type, preferably without having to import and go through all documents.

Related

Get document by min size of array in Mongodb

I have mongo collection:
{
"_id" : 123,
"index" : "111",
"students" : [
{
"firstname" : "Mark",
"lastname" : "Smith"),
}
],
}
{
"_id" : 456,
"index" : "222",
"students" : [
{
"firstname" : "Mark",
"lastname" : "Smith"),
}
],
}
{
"_id" : 789,
"index" : "333",
"students" : [
{
"firstname" : "Neil",
"lastname" : "Smith"),
},
{
"firstname" : "Sofia",
"lastname" : "Smith"),
}
],
}
I want to get document that has index that is in the set of the given indexes, for example givenSet = ["111","333"] and has min length of students array.
Result should be the first document with _id:123, because its index is in the givenSet and studentsArrayLength = 1, which is smaller than third.
I need to write custom JSON #Query for Spring Mongo Repository. I am new to Mongo and am stuck a bit with this problem.
I wrote something like this:
#Query("{'index':{$in : ?0}, length:{$size:$students}, $sort:{length:-1}, $limit:1}")
Department getByMinStudentsSize(Set<String> indexes);
And got error: error message '$size needs a number'
Should I just use .count() or something like that?
you should use the aggregation framework for this type of query.
filter the result based on your condition.
add a new field and assign the array size to it.
sort based on the new field.
limit the result.
the solution should look something like this:
db.collection.aggregate([
{
"$match": {
index: {
"$in": [
"111",
"333"
]
}
}
},
{
"$addFields": {
"students_size": {
"$size": "$students"
}
}
},
{
"$sort": {
students_size: 1
}
},
{
"$limit": 1
}
])
working example: https://mongoplayground.net/p/ih4KqGg25i6
You are getting the issue because the second param should be enclosed in curly braces. And second param is projection
#Query("{{'index':{$in : ?0}}, {length:{$size:'$students'}}, $sort:{length:1}, $limit:1}")
Department getByMinStudentsSize(Set<String> indexes);
Below is the mongodb query :
db.collection.aggregate(
[
{
"$match" : {
"index" : {
"$in" : [
"111",
"333"
]
}
}
},
{
"$project" : {
"studentsSize" : {
"$size" : "$students"
},
"students" : 1.0
}
},
{
"$sort" : {
"studentsSize" : 1.0
}
},
{
"$limit" : 1.0
}
],
{
"allowDiskUse" : false
}
);

Mongoose + GraphQL (Apollo Server) Schema

We have db collection which is little complicated. Many of our keys are JSON objects where fields aren't fixed and change based on input given by user on UI. How should we write mongoose and GraphQL Schema for such complex type ?
{
"_id" : ObjectId("5ababb359b3f180012762684"),
"item_type" : "Sample",
"title" : "This is sample title",
"sub_title" : "Sample sub title",
"byline" : "5c6ed39d6ed6def938b71562",
"lede" : "Sample description",
"promoted" : "",
"slug" : [
"myurl"
],
"categories" : [
"Technology"
],
"components" : [
{
"type" : "Slide",
"props" : {
"description" : {
"type" : "",
"props" : {
"value" : "Sample value"
}
},
"subHeader" : {
"type" : "",
"props" : {
"value" : ""
}
},
"ButtonWorld" : {
"type" : "a-button",
"props" : {
"buttonType" : "product",
"urlType" : "Internal Link",
"isServices" : false,
"title" : "Hello World",
"authors" : [
{
"__dataID__" : "Qm9va0F1dGhvcjo1YWJhYjI0YjllNDIxNDAwMTAxMGNkZmY=",
"_id" : null,
"First_Name" : "John",
"Last_Name" : "Doe",
"Display_Name" : "John Doe",
"Slug" : "john-doe",
"Role" : 1
}
],
"isbns" : [
"9781497603424"
],
"image" : "978-cover.jpg",
"price" : "8.99",
"bisacs" : [],
"customCategories" : [],
},
"salePrice" : {
"type" : "",
"props" : {
"value" : ""
}
}
}
},
"tags" : [
{
"id" : "5abab58042e2c90011875801",
"name" : "Tag Test 1"
},
{
"id" : "5abab5831242260011c248f9",
"name" : "Tag Test 2"
},
{
"id" : "592450e0b1be5055278eb5c6",
"name" : "horror",
},
{
"id" : "59244a96b1be5055278e9b0e",
"name" : "Special Report",
"_id" : "59244a96b1be5055278e9b0e"
}
],
"created_at" : ISODate("2018-03-27T21:44:21.412Z"),
"created_by" : ObjectId("591345bda429e90011c1797e")
}
I believe Mongoose have Mixed type but how do i represent such complex type in Apollo GraphQL Server and Mongoose Schema. Also, currently my resolver is just models.product.find(). So if i have such complex type, need to understand what update needs to make to my resolver.
It will be great if i get complete solution for GraphQL Apollo schema, mongoose schema and resolver for my data.
Finally found solution for problem.
You can declare new type and reference it in typeDef for GraphQL Schema.
In mongoose model, you can reference it as {type: Array}

Partial update overwriting whole structure

I'm indexing a new document with the following content
{
"lastUpdate" : "20180114144020452",
"name" : "My Process",
"startDate" : "20180114162356585",
"endData" : "",
"tasks" : [
{
"1" : {
"lastUpdate" : "20180114144020452",
"taskId" : "123",
"subject" : "Terceira Atividade",
"status" : "Active",
"type" : "userTask",
"assign" : [
{
"date" : "20180114144020452",
"type" : "role",
"name" : "Time 3",
"id" : "Team3_345"
}
],
"receivedDate" : "",
"readDate" : "",
"finishDate" : ""
}
}
]
}
And then I'm trying to change task.1.status value with the following update content
{
"doc" : {
"tasks" : [
{
"1" : {
"status" : "Closed"
}
}
]
}
}
But it's overwriting the whole task.1 structure, deleting other values and letting only status value to closed instead of keep other values and change only status value.
How can I solve this? Thanks
You need to do it via a scripted partial updated like this
POST updates/update/1/_update
{
"script": {
"source": "ctx._source.tasks[0].1.status = 'Closed'"
}
}

Update object in array with new fields mongodb

ai have some mongodb document
horses is array with id, name, type
{
"_id" : 33333333333,
"horses" : [
{
"id" : 72029,
"name" : "Awol",
"type" : "flat",
},
{
"id" : 822881,
"name" : "Give Us A Reason",
"type" : "flat",
},
{
"id" : 826474,
"name" : "Arabian Revolution",
"type" : "flat",
}
}
I need to add new fields
I thought something like that, but I did not go to his head
horse = {
"place" : 1,
"body" : 11
}
Card.where({'_id' => 33333333333}).find_and_modify({'$set' => {'horses.' + index.to_s => horse}}, upsert:true)
But all existing fields are removed and inserted new how to do that would be new fields added to existing
Indeed, this command will overwrite the subdocument
'$set': {
'horses.0': {
"place" : 1,
"body" : 11
}
}
You need to set individual fields:
'$set': {
'horses.0.place': 1,
'horses.0.body': 11
}

ElasticSerach - Statistical facets on length of the list

I have the following sample mappipng:
{
"book" : {
"properties" : {
"author" : { "type" : "string" },
"title" : { "type" : "string" },
"reviews" : {
"properties" : {
"url" : { "type" : "string" },
"score" : { "type" : "integer" }
}
},
"chapters" : {
"include_in_root" : 1,
"type" : "nested",
"properties" : {
"name" : { "type" : "string" }
}
}
}
}
}
I would like to get a facet on number of reviews - i.e. length of the "reviews" array.
For instance, verbally spoken results I need are: "100 documents with 10 reviews, 20 documents with 5 reviews, ..."
I'm trying the following statistical facet:
{
"query" : {
"match_all" : {}
},
"facets" : {
"stat1" : {
"statistical" : {"script" : "doc['reviews.score'].values.size()"}
}
}
}
but it keeps failing with:
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], total failure; shardFailures {[mDsNfjLhRIyPObaOcxQo2w][facettest][0]: QueryPhaseExecutionException[[facettest][0]: query[ConstantScore(NotDeleted(cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter#a2a5984b)))],from[0],size[10]: Query Failed [Failed to execute main query]]; nested: PropertyAccessException[[Error: could not access: reviews; in class: org.elasticsearch.search.lookup.DocLookup]
[Near : {... doc[reviews.score].values.size() ....}]
^
[Line: 1, Column: 5]]; }]",
"status" : 500
}
How can I achieve my goal?
ElasticSearch version is 0.19.9.
Here is my sample data:
{
"author" : "Mark Twain",
"title" : "The Adventures of Tom Sawyer",
"reviews" : [
{
"url" : "amazon.com",
"score" : 10
},
{
"url" : "www.barnesandnoble.com",
"score" : 9
}
],
"chapters" : [
{ "name" : "Chapter 1" }, { "name" : "Chapter 2" }
]
}
{
"author" : "Jack London",
"title" : "The Call of the Wild",
"reviews" : [
{
"url" : "amazon.com",
"score" : 8
},
{
"url" : "www.barnesandnoble.com",
"score" : 9
},
{
"url" : "www.books.com",
"score" : 5
}
],
"chapters" : [
{ "name" : "Chapter 1" }, { "name" : "Chapter 2" }
]
}
It looks like you are using curl to execute your query and this curl statement looks like this:
curl localhost:9200/my-index/book -d '{....}'
The problem here is that because you are using apostrophes to wrap the body of the request, you need to escape all apostrophes that it contains. So, you script should become:
{"script" : "doc['\''reviews.score'\''].values.size()"}
or
{"script" : "doc[\"reviews.score"].values.size()"}
The second issue is that from your description it looks like your are looking for a histogram facet or a range facet but not for a statistical facet. So, I would suggest trying something like this:
curl "localhost:9200/test-idx/book/_search?search_type=count&pretty" -d '{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"key_script" : "doc[\"reviews.score\"].values.size()",
"value_script" : "doc[\"reviews.score\"].values.size()",
"interval" : 1
}
}
}
}'
The third problem is that the script in the facet will be called for every single record in the result list and if you have a lot of results it might take really long time. So, I would suggest indexing an additional field called number_of_reviews that should be populated with the number of reviews by your client. Then your query would simply become:
curl "localhost:9200/test-idx/book/_search?search_type=count&pretty" -d '{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"field" : "number_of_reviews"
"interval" : 1
}
}
}
}'

Resources