Aggregations on PyElasticSearch (pyes) - elasticsearch

I wish to calculate value-count aggregations on some indexed product data, but I seem to be getting some parameters in the ValueCountAgg constructor wrong.
An example of such indexed data is as follows -:
{
"_index": "test-index",
"_type": "product_product",
"_id": "1",
"_score": 1,
"_source": {
"code": "SomeProductCode1",
"list_price": 10,
"description": null,
"displayed_on_eshop": "true",
"active": "true",
"tree_nodes": [],
"id": 1,
"category": {},
"name": "This is Product",
"price_lists": [
{
"price": 10,
"id": 1
},
{
"price": 10,
"id": 2
}
],
"attributes": {
"color": "blue",
"attrib": "something",
"size": "L"
},
"type": "goods"
}
}
I'm calculating aggregations as follows -:
for attribute in filterable_attributes:
count = ValueCountAgg(
name='count_'+attribute, field='attributes.'+attribute
)
query.agg.add(count)
where query is a ~pyes.query.Query object wrapped inside a ~pyes.query.Search object. filterable_attributes is a list of attribute names, such as color and size.
I have tried setting field=attribute as well, but it seems to make no difference. The resultset that I obtain on conducting the search has the following as its aggs attribute -:
{'count_size': {'value': 0}, 'count_color': {'value': 0}}
where size and color are indexed inside the attributes dictionary as shown above. These are evidently wrong results, and I think it is because I am not setting field properly.
Where am I going wrong?

I've found where I was going wrong.
According to Scoping Aggregations, the scope of an aggregation is by default associated with its query. My query was returning zero results, and I had to modify the search phrase for the same.
I got the required results after that, and aggregations are coming out right.
{'count_size': {'value': 3}, 'count_color': {'value': 3}}

Related

Indexing In ElasticSearch For Auditing

There is a microservice-based architecture wherein each service has a different type of entity. For example:
Service-1:
{
"entity_type": "SKU",
"sku": "123",
"ext_sku": "201",
"store": "1",
"product": "abc",
"timestamp": 1564484862000
}
Service-2:
{
"entity_type": "PRODUCT",
"product": "abc",
"parent": "xyz",
"description": "curd",
"unit_of_measure": "gm",
"quantity": "200",
"timestamp": 1564484863000
}
Service-3:
{
"entity_type": "PRICE",
"meta": {
"store": "1",
"sku": "123"
},
"price": "200",
"currency": "INR",
"timestamp": 1564484962000
}
Service-4:
{
"entity_type": "INVENTORY",
"meta": {
"store": "1",
"sku": "123"
},
"in_stock": true,
"inventory": 10,
"timestamp": 1564484864000
}
I want to write an Audit Service backed by elasticsearch, which will ingest all these entities and it will index based on entity_type, store, sku, timestamp.
Will elasticsearch be a good choice here? Also, how will the indexing work? So, for example, if I search for store=1, it should return all the different entities that have store as 1. Secondly, will I be able to get all the entities between 2 timestamps?
Will ES and Kibana (to visualize) be good choices here?
Yes. Your use case is pretty much exactly what is described in the docs under filter context:
In filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016?
Is the status field set to published?

Elastic serach record upsert with a complex _id field

I have to upsert bulk records in elastic search index with _id being combination of more than one field from the message. Can I do so. if that can be done then please give me a sample json for the same.
Regards
A sample _id field I am looking for some thing like below
{
"_index": "kpi_aggr",
"_type": "KPIBackChannel",
"_id": "<<<combination of name , period_type>>>",
"_score": 1,
"_source": {
"name": "kpi-v1",
"period_type": "w",
"country": "AL",
"pg_name": "DENTAL CARE",
"panel_type": "retail",
"number_of_records_with_proposal": 10000,
"number_of_proposals": 80000,
"overall_number_of_records": 2000,
"#timestamp": 1442162810
}
}
Naturally, you can specify your own Elasticsearch document ids during a call to the Index API:
PUT kpi_aggr/KPIBackChannel/kpi-v1,w
{
"name": "kpi-v1",
"period_type": "w",
"country": "AL",
"pg_name": "DENTAL CARE",
"panel_type": "retail",
"number_of_records_with_proposal": 10000,
"number_of_proposals": 80000,
"overall_number_of_records": 2000,
"#timestamp": 1442162810
}
You can also do so during a _bulk API call:
POST _bulk
{ "index" : { "_index" : "kpi_aggr", "_type" : "KPIBackChannel", "_id" : "kpi-v1,w" } }
{"name":"kpi-v1","period_type":"w","country":"AL","pg_name":"DENTAL CARE","panel_type":"retail","number_of_records_with_proposal":10000,"number_of_proposals":80000,"overall_number_of_records":2000,"#timestamp":1442162810}
Notice that Elasticsearch will replace the document with the new version.
If you execute these two queries on an empty index, then querying by document id:
GET kpi_aggr/KPIBackChannel/kpi-v1,w
will give you the following:
{
"_index": "kpi_aggr",
"_type": "KPIBackChannel",
"_id": "kpi-v1,w",
"_version": 2,
"found": true,
"_source": {
"name": "kpi-v1",
"period_type": "w",
"country": "AL",
"pg_name": "DENTAL CARE",
"panel_type": "retail",
"number_of_records_with_proposal": 10000,
"number_of_proposals": 80000,
"overall_number_of_records": 2000,
"#timestamp": 1442162810
}
}
Notice "_version": 2, which in our case indicates that a document has been indexed twice, hence performed an "upsert" (but in general is meant to be used for Optimistic Concurrency Control).
Hope that helps!

ElasticSearch URI Search null field

I need to create a query via URI to filter all data between two dates and also if this date field is null.
For example:
I have the field "creation_date" in some objects, however I want that in the resulting also does not appear the objects that the field does not have.
I tried something similar below:
http://localhost//elasticsearch/channels/channel/_search?q=channel.schedule.creation_date:[2018-06-19 TO 2018-12-22] OR channel.schedule.creation_date: NULL
As far as comparing the dates is OK, it works. The problem is to get the NULL values.
Edited
Source sample:
"_source": {
"channel": {
"activated": false,
"approved": false,
"content": "Jvjv",
"creation_date": "2018-06-21T13:06:10.000Z",
"facebookLink": "J jv",
"id": "Kvjvjv",
"instagramId": "Jvjv",
"name": "Kbkbkvk",
"ownerId": "sZtxdhiNbNY9sr2DtiCzlgJfsqb2",
"plan": 0,
"purpose": "Jvjv",
"recurrence": 1,
"segment": "Jvjvjv",
"twitterId": "Jvjv",
"youtubeId": "Jvj"
}
}
}
You can do this using the NOT(_exists_:field_name) constraint:
Can you try this ?
http://localhost//elasticsearch/channels/channel/_search?q=channel.schedule.creation_date:[2018-06-19 TO 2018-12-22] OR NOT(_exists_:channel.schedule.creation_date)

mgo with aggregation and grouping

I am trying to perform a query using golang mgo
to effectively get distinct values from a join, I understand that this might not be the best paradigm to work with in Mongo.
Something like this:
pipe := []bson.M{
{
"$group": bson.M{
"_id": bson.M{"user": "$user"},
},
},
{
"$match": bson.M{
"_id": bson.M{"$exists": 1},
"user": bson.M{"$exists": 1},
"date_updated": bson.M{
"$gt": durationDays,
},
},
},
{
"$lookup": bson.M{
"from": "users",
"localField": "user",
"foreignField": "_id",
"as": "user_details",
},
},
{
"$lookup": bson.M{
"from": "organizations",
"localField": "organization",
"foreignField": "_id",
"as": "organization_details",
},
},
}
err := d.Pipe(pipe).All(&result)
If I comment out the $group section, the query returns the join as expected.
If I run as is, I get NULL
If I move the $group to the bottom of the pipe I get an array response with Null values
Is it possible to do do an aggregation with a $group (with the goal of simulating DISTINCT) ?
The reason you're getting NULL is because your $match filter is filtering out all of documents after the $group phase.
After your first stage of $group the documents are only as below example:
{"_id": { "user": "foo"}},
{"_id": { "user": "bar"}},
{"_id": { "user": "baz"}}
They no longer contains the other fields i.e. user, date_updated and organization. If you would like to keep their values, you can utilise Group Accumulator Operator. Depending on your use case you may also benefit from using Aggregation Expression Variables
As an example using mongo shell, let's use $first operator which basically pick the first occurrence. This may make sense for organization but not for date_updated. Please choose a more appropriate accumulator operator.
{"$group": {
"_id":"$user",
"date_updated": {"$first":"$date_updated"},
"organization": {"$first":"$organization"}
}
}
Note that the above also replaces {"_id":{"user":"$user"}} with simpler {"_id":"$user"}.
Next we'll add $project stage to rename our result of _id field from the group operation back to user. Also carry along the other fields without modifications.
{"$project": {
"user": "$_id",
"date_updated": 1,
"organization": 1
}
}
Your $match stage can be simplified, by just listing the date_updated filter. First we can remove _id as it's no longer relevant up to this point in the pipeline, and also if you would like to make sure that you only process documents with user value you should placed $match before the $group. See Aggregation Pipeline Optimization for more.
So, all of those combined will look something as below:
[
{"$group":{
"_id": "$user",
"date_updated": { "$first": "$date_updated"},
"organization": { $first: "$organization"}
}
},
{"$project":{
"user": "$_id",
"date_updated": 1,
"organization": 1
}
},
{"$match":{
"date_updated": {"$gt": durationDays } }
},
{"$lookup":{
"from": "users",
"localField": "user",
"foreignField": "_id",
"as": "user_details"
}
},
{"$lookup":{
"from": "organizations",
"localField": "organization",
"foreignField": "_id",
"as": "organization_details"
}
}
]
(I know you're aware of it) Lastly, based on the database schema above with users and organizations collections, depending on your application use case you may re-consider embedding some values. You may find 6 Rules of Thumb for MongoDB Schema Design useful.

ElasticSearch _Source is always empty on the return

I am posting a query to http://localhost:9200/movie_db/movie/_search but _source attribute is always empty on the return resposne. I made it enabled but that doesn't help.
Movie DB:
TRY DELETE /movie_db
PUT /movie_db {"mappings": {"movie": {"properties": {"title": {"type": "string", "analyzer": "snowball"}, "actors": {"type": "string", "position_offset_gap" : 100, "analyzer": "standard"}, "genre": {"type": "string", "index": "not_analyzed"}, "release_year": {"type": "integer", "index": "not_analyzed"}, "description": {"_source": true, "type": "string", "analyzer": "snowball"}}}}}
BULK INDEX movie_db/movie
{"_id": 1, "title": "Hackers", "release_year": 1995, "genre": ["Action", "Crime", "Drama"], "actors": ["Johnny Lee Miller", "Angelina Jolie"], "description": "High-school age computer expert Zero Cool and his hacker friends take on an evil corporation's computer virus with their hacking skills."}
{"_id": 2, "title": "Johnny Mnemonic", "release": 1995, "genre": ["Science Fiction", "Action"], "actors": ["Keanu Reeves", "Dolph Lundgren"], "description": "A guy with a chip in his head shouts incomprehensibly about room service in this dystopian vision of our future."}
{"_id": 3, "title": "Swordfish", "release_year": 2001, "genre": ["Action", "Crime"], "actors": ["John Travolta", "Hugh Jackman", "Halle Berry"], "description": "A cast of characters challenge society's commonly held view that computer experts are not the beautiful people. Somehow, the CIA is hacked in under 5 minutes."}
{"_id": 4, "title": "Tomb Raider", "release_year": 2001, "genre": ["Adventure", "Action", "Fantasy"], "actors": ["Angelina Jolie", "Jon Voigt"], "description": "The story of a girl and her quest for antiquities in the face of adversity. This epic is adapter from its traditional video-game format to the big screen"}
Query:
{
"query" :
{
"term" : { "genre" : "Crime" }
},
}
Results:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "movie_db",
"_type": "movie",
"_id": "3",
"_score": 0.30685282,
"_source": {}
},
{
"_index": "movie_db",
"_type": "movie",
"_id": "1",
"_score": 0.30685282,
"_source": {}
}
]
}
}
I had the same problem: despite enabling _source in my query as well as in my mappings, _source would always be {}.
Your proposed solution of setting cluster.name in elasticsearch.yml gave me the hint that the problem must be some hidden setting in the old cluster.
I found out that I had an index template definition that came with a plugin I installed (in my case elasticsearch-transport-couchbase), which said
"_source" : {
"includes" : [ "meta.*" ]
},
thereby implicitely excluding all fields other than meta.* from source.
Check your templates like this:
curl -XGET localhost:9200/_template/?pretty
I deleted the couchbase template like so
curl -XDELETE localhost:9200/_template/couchbase
and created a new, almost identical one but with source enabled.
Here is how:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html
Solution:
In elasticsearch config folder, open elasticsearch.yml and set cluster.name to a different value, then restart elasticsearch.bat
I once accidentally passed a single field in source array and that too didn't exist. Just for example "_source": ["bazinga"] and in the aggregations result source was empty.
So maybe you could simple pass a totally unrelated string into the _source array. This can be a better solution instead of making changes in the elasticsearch.yml file.

Resources