N1QL vs ElasticSearch Join - elasticsearch

My documents are:
Iwp::1::Porcentaje::Period::1
{
"id": null,
"period": 1,
"type": "IwpCumulative",
"category": "Porcentaje",
"sumEarn": 0,
"sumActual": 0.2248520710059172,
"sumForecast": 0,
"sumPlanned": 0,
"sumValue": 0,
"parent": "Iwp::1"
}
Iwp::1
{
"name": "Iwp 1",
"description": "Iwp 1 Description",
"manyPeriods": 50,
"type": "Iwp",
"countCC": 0,
"costCode": [
"CostCode::3",
"CostCode::4"
],
"iwpCumulatives": [
"Iwp::1::Porcentaje::Period::1",
.......
"Iwp::1::Porcentaje::Period::50",
"Iwp::1::Qty::Period::1",
........
"Iwp::1::Qty::Period::50",
]
}
How I Could do this query at ElasticSearch?
N1QL:
select
t.category,
t.period,
sum(t.sumActual)
from
default as q
inner join default as p on keys q.parent
inner join default as t on keys p.iwpCumulatives
where
q.type = 'IwpCumulative'
and q.period = 50
and q.sumActual > 0
and q.category = 'Porcentaje'
group by t.category,t.period
order by t.period,t.category;
I have this querys at ElasticSearch:
{
"query":{
"filtered":{
"query":{
"bool":{
"must":[
{"term":{"period":"5"}},
{"term":{"type":"iwpcumulative"}},
{"range":{"sumActual":{"gt":"0"}}},
{"term":{"category":"porcentaje"}}
]
}
}
}
}
}
and this:
{
"size":0,
"aggs":{
"group_by_state":{
"terms":{
"field":"category"
},
"aggs":{
"costars":{
"terms":{
"field":"period"
},
"aggs":{
"Suma":{
"sum":{
"field":"earn"
}
}
}
}
}
}
}
}
Now, I need to use the first result with their Id so I will to use at the second query.
Thanks in advance.

Because you use only one tables result in select terms (semi join) you can use the siren-join plugin for elasticsearch:
Look at this :
SIREn Plugin to add relational join capabilities to Elasticsearch

Related

Indexing/search algorithm stability between versions

I'm migrating from Elasticsearch 1.5 to 7.10 there are multiple required changes, the most relevant one is the removal of the document type concept in version 6, to deal with it I introduced a new field doc_type and then I match with it when I search.
My question is, when I make the same (or equivalent because there are some changes) search query should I expect to have the exact same result set? Because I'm having some differences, so I would like to figure out if I broke something in the new mappings or in the search query.
Thank you in advance
Edit after first question:
In general: I have a service that communicates with ES 1.5 and I have to migrate it to ES 7.10 keeping the external API as stable as possible.
I'm not using scoring.
Previously I had document types A and B, when I make a query like this for example: host/indexname/A,B/_search, after the migration I keep A or B in doc_type, and the query becomes host/indexname/_search with a "bool":{"should":[{"terms":{"doc_type":["A"],"boost":1.0}},{"terms":{"doc_type":["B"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0} in the body. If I put it in different indexes for A and B and the user want to match in both of them I'll have to "merge" the search response for both queries and I don't know which strategy should I follow for that, so keeping it all together I get a response with mixed (doc_type) results from ES. I followed this specific approach https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch#custom-type-field
The differences are not so big, difficult to show a concrete example because it's a complex data/doc structure but the idea is, having for 1.5 this response for a giving query for example:
[a, b, c, d, e, f, g, h, i, j] (where each one may have any of types A or B)
With 7.10 I'm having responses like:
[a, b, e, c, d, f, g, h, i, j] or [a, b, c, d, e, g, i, j, k]
Second edit:
This query has been generated from the java client.
{
"from":0,
"size":100,
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"mark_deleted:false",
"fields":[
],
"type":"best_fields",
"default_operator":"or",
"max_determinized_states":10000,
"enable_position_increments":true,
"fuzziness":"AUTO",
"fuzzy_prefix_length":0,
"fuzzy_max_expansions":50,
"phrase_slop":0,
"escape":false,
"auto_generate_synonyms_phrase_query":true,
"fuzzy_transpositions":true,
"boost":1.0
}
},
{
"bool":{
"should":[
{
"terms":{
"type":[
"A"
],
"boost":1.0
}
},
{
"terms":{
"type":[
"B"
],
"boost":1.0
}
},
{
"terms":{
"type":[
"D"
],
"boost":1.0
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
},
"post_filter":{
"term":{
"mark_deleted":{
"value":false,
"boost":1.0
}
}
},
"sort":[
{
"a_specific_date":{
"order":"desc"
}
}
],
"highlight":{
"pre_tags":[
"<b>"
],
"post_tags":[
"</b>"
],
"no_match_size":120,
"fields":{
"body":{
"fragment_size":120,
"number_of_fragments":1
}
}
}
}
First, since you don't care about scoring you should use bool/filter instead of bool/must at the top level, otherwise your results are sorted by _score by default and between 1.7 et 7.10, there have been so many changes that it would explain the differences you get. So you're better off simply sorting the results using any other field than _score
Second, instead of the bool/should on type you can use a simple terms query, which does exactly the same job, yet in a simpler way:
{
"from": 0,
"size": 100,
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "mark_deleted:false",
"fields": [],
"type": "best_fields",
"default_operator": "or",
"max_determinized_states": 10000,
"enable_position_increments": true,
"fuzziness": "AUTO",
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"phrase_slop": 0,
"escape": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": true,
"boost": 1
}
},
{
"terms": {
"type": [
"A",
"B",
"C"
]
}
}
]
}
},
"post_filter": {
"term": {
"mark_deleted": {
"value": false,
"boost": 1
}
}
},
"sort": [
{
"a_specific_date": {
"order": "desc"
}
}
],
"highlight": {
"pre_tags": [
"<b>"
],
"post_tags": [
"</b>"
],
"no_match_size": 120,
"fields": {
"body": {
"fragment_size": 120,
"number_of_fragments": 1
}
}
}
}
Finally, I'm not sure why you're using a query_string query to do an exact match on mark_deleted:false, it doesn't make sense to me. A simple term query would be better and more adequate here.
Also not clear why you have remove all results that also have mark_deleted:false in your post_filter, since it's the same condition as in your query_string constraint.

Useing filtering bucket to filter a moving average result

I do need to know if I can filter based on my moving average results? ( I just want the ones that are = 100 ) I try to use filtering bucket but always receive an error
"Unknown key for a START_OBJECT in [hostname_bucket_filter]."
Do not know what else to do, this is the query:
"aggs":{
"aggs_host":{
"terms":{
"field":"dimensions.hostname.value",
"size":10,
"order":{"_key":"desc"},
"min_doc_count":1},
"aggs":{
"aggs_fs":{
"terms":{
"field":"dimensions.mount.value",
"size":10,
"order":{"_key":"desc"},
"min_doc_count":1
},
"aggs":{
"agg_date":{
"date_histogram":{
"interval":"1h",
"field":"timestamp",
"min_doc_count":0
},
"aggs":{
"agg_ave":{
"avg":{
"field":"value.double"
}
},
"aggs_ma":{
"moving_avg":{
"buckets_path":"agg_ave",
"window":6,
"model":"holt_winters",
"settings" : {
"type": "mult",
"alpha": 0.6,
"beta": 0.8,
"gamma": 0.6,
"period" : 3
},
"predict":7,
"minimize":false
}
}
}
}
}
}
}
}
},
"hostname_bucket_filter" : {
"bucket_selector" : {
"bucket_path" : {
"the_hosts_filt": "aggs_host>aggs_fs>agg_date>agg_ave.aggs_ma"
},
"script" : "params.the_hosts_filt >= 100"
}
}

Elasticsearch query to return the most recent of 'each document' based on a condition

I am trying to retrieve the most recent version of each document in my dataset when the document is not already archived (archived: false). So when any version of the document has archived set to true, it should not appear in my result.
An example of my dataset:
{
name: "soccer game",
base_id: 1,
hours_remaining: 10,
updatedDate: 2019 - 03 - 10,
archived: false
}
{
name: 'basketball game",
base_id: 2,
hours_remaining: 20,
updatedDate: 2019 - 03 - 10,
archived: false
}
{
name: "soccer game",
base_id: 1,
hours_remaining: 5,
updatedDate: 2019 - 03 - 14,
archived: true
}
The expected result is :
{
name: 'basketball game",
base_id: 2,
hours_remaining: 20,
timestamp: 2019 - 03 - 10,
archived: false
}
After writing several queries, I haven't been able to achieve my goal. This is one of my attempts.
{
"size": 10,
"query":{
"bool":{
"must":[
{
"query_string":{
"query": "*",
"fields":["name.keyword"]
}
},
{
"term":{
"archived": false
}
}
]
}
},
"collapse": {
"field": "base_id",
"inner_hits": {
"name": "most_recent",
"size": 1,
"sort": [{"updatedDate": "desc"}]
}
}
}
What am I doing wrong?
I believe your query_string can be avoided. If you would like to get only the archived: false, the name is not necessary.
I think you should first use a must condition to filter the ones that have the field archived set to False, then you should use a terms aggregation on the name field so it will bring you back the unique names that fulfill the must condition.
You can then use a max aggregation as sub-aggregation to bring you only the biggest value in the updatedDate field. The final query should look like this:
"size": 0, #We don't care about the size of this
"query":{
"bool": {
"must": {
"term":{
"archieved": false #Only false will be shown
}
}
},
}
"aggs":{
"names":{
"terms":{
"field": "name.keyword" #Unique names will be shown here
},
"aggs":{
"most_recent":{
"max": {
"field": "updatedDate" #The max value of this field
}
}
}
}
}
Hope this is helpful! :D
This is more simple. Query on archived, regardless of anything else.
then order by date desc, and keep only the first one (size = 1)
{
"size": 1,
"query": {
"bool": {
"must": {
"term": {
"archived": "false"
}
}
}
},
"sort": [
{
"updatedDate": {
"order": "desc"
}
}
]
}

Elasticsearch Sorting by Likes and Dislikes

I've been struggling to express the current logic problem I'm trying to solve with Elasticsearch, and I think I have a good way to represent it.
Let's say I'm building out an API to sort Mario Kart characters in order of the user's preference. The user can list characters they like, and those they dislike. Here is the data set:
{character: {name: "Mario", weight: "Light"}},
{character: {name: "Luigi", weight: "Medium"}},
{character: {name: "Peach", weight: "Light"}},
{character: {name: "Bowser", weight: "Heavy"}},
{character: {name: "Toad", weight: "Light"}},
{character: {name: "Koopa", weight: "Medium"}}
The user inputs that they like Mario and Luigi and do not like Bowser. With Elasticsearch, how could I go about sorting this data for the user so the list is returned like so:
[Mario (+), Luigi (+), Peach, Toad, Koopa, Bowser (-)]
*Pluses and minuses in there for legibility.
This would return the user's top choices in front, the ones they are OK with in the middle, and the ones they don't prefer at the end. Having to use nested queries really trips me up here.
Evolving the query, let's say there's a team mode where each team is comprised of pairs of two, determined by the game in the following pairs:
[Luigi (+), Bowser (-)]
[Mario (+), Peach]
[Toad, Koopa]
How to I ensure that I don't filter out teams that contain Bowser, yet still weight the results so that it's like so:
[Mario (+), Peach]
[Toad, Koopa]
[Luigi (+), Bowser (-)]
Or, should [Luigi, Bowser] actually rank second?
I'm very confused about building complex queries like these in Elasticsearch and would appreciate any help.
Depending on your mapping, something along the lines of
GET /characters/_search
{
"sort":[
"_score"
],
"query":{
"bool":{
"should":[
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Mario"
}
},
"boost":2.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Luigi"
}
},
"boost":2.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Peach"
}
},
"boost":1.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Toad"
}
},
"boost":1.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Koopa"
}
},
"boost":1.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Bowser"
}
},
"boost":0
}
}
]
}
}
}
should work.
PS: IF you have a nested mapping then surround the bool query with a nested query clause and adjust the field name paths. To return only the name field add _source clause before the query with path to name as value.
First off I gotta say - IMHO using Elasticsearch for this is major overkill. You should probably go with a much simpler in memory data structure for this calculation.
Assuming you do decide to implement this with Elasticsearch, I would do the following thing:
1) Represent each character as a document using this mapping -
PUT game/characters/_mapping
{
"properties": {
"name":{
"type": "keyword"
},
"weight": {
"type": "keyword"
}
}
}
2) Each character will look like so:
PUT game/characters/boswer
{
"name": "bowser",
"weight": "heavy"
}
3) And then you can fetch them ordered by likes similiarly to how #sramalingam24 suggested. Note that the boosts must non-negative, so you'd need to "normalize" the likeability of the characters to a range above zero:
GET game/characters/_search
{
"size": 100,
"query": {
"bool": {
"should": [
{
"constant_score": {
"filter": {
"term": {
"name": "Peach"
}
},
"boost": 2
}
},{
"constant_score": {
"filter": {
"term": {
"name": "Mario"
}
},
"boost": 2
}
},{
"constant_score": {
"filter": {
"term": {
"name": "Toad"
}
},
"boost": 1
}
},{
"constant_score": {
"filter": {
"term": {
"name": "Bowser"
}
},
"boost": 0
}
},
]
}
}
}
Good luck!

Elastic Search get top grouped sums with additional filters (Elasticsearch version5.3)

This is my Mapping :
{
"settings" : {
"number_of_shards" : 2,
"number_of_replicas" : 1
},
"mappings" :{
"cpt_logs_mapping" : {
"properties" : {
"channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
"playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
"country_code" : {"type":"text","store":"yes","index":"analyzed"},
"playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
"channel_name" : {"type":"text","store":"yes","index":"analyzed"},
"device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
}
}
}
}
I want to query the index similar to the way I do using the following MySQL query :
SELECT
channel_name,
SUM(`playtime_in_sec`) as playtime_in_sec
FROM
channel_play_times_bar_chart
WHERE
country_code = 'country' AND
device_report_tag = 'device' AND
channel_name = 'channel'
playing_date BETWEEN 'date_range_start' AND 'date_range_end'
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;
So far my QueryDSL looks like this
{
"size": 0,
"aggs": {
"ch_agg": {
"terms": {
"field": "channel_id",
"size": 30 ,
"order": {
"sum_agg": "desc"
}
},
"aggs": {
"sum_agg": {
"sum": {
"field": "playtime_in_sec"
}
}
}
}
}
}
QUESTION 1
Although the QueryDSL I have made does return me the top 30 channel_ids w.r.t playtimes but I am confused how to add other filters too within the search i.e country_code, device_report_tag & playing_date.
QUESTION 2
Another issue is that the result set contains only the channel_id and playtime fields unlike the MySQL result set which returns me channel_name and playtime_in_sec columns. This means I want to achieve aggregation using channel_id field but result set should instead return corresponding channel_name name of the group.
NOTE: Performance over here is a top priority as this is supposed to be running behind a graph generator querying millions or even more docs.
TEST DATA
hits: [
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 1453,
playtime_in_sec: 35,
device_report_tag: "mydev",
channel_report_tag: "Sony Six",
country_code: "SE",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 145,
playtime_in_sec: 25,
device_report_tag: "mydev",
channel_report_tag: "Star Movies",
country_code: "US",
#timestamp: "2017-08-11",
}
},
{
_index: "cpt_logs_index",
_type: "cpt_logs_mapping",
_id: "",
_score: 1,
_source: {
ChID: 12,
playtime_in_sec: 15,
device_report_tag: "mydev",
channel_report_tag: "HBO",
country_code: "PK",
#timestamp: "2017-08-12",
}
}
]
QUESTION1:
Are you looking to add a filter/query to the example above? If so you can simply add a "query" node to the query document:
{
"size": 0,
"query":{
"bool":{
"must":[
{"terms": { "country_code": ["pk","us","se"] } },
{"range": { "#timestamp": { "gt": "2017-01-01", "lte": "2017-08-11" } } }
]
}
},
"aggs": {
"ch_agg": {
"terms": {
"field": "ChID",
"size": 30
},
"aggs":{
"ch_report_tag_agg": {
"terms":{
"field" :"channel_report_tag.keyword"
},
"aggs":{
"sum_agg":{
"sum":{
"field":"playtime_in_sec"
}
}
}
}
}
}
}
}
You can use all normal queries/filters for elastic to pre-filter your search before you start aggregating (Regarding performance, elasticsearch will apply any filters / queries before starting to aggregate, so any filtering you can do here will help a lot)
Question2:
On the top of my head I would suggest one of two solutions (unless I'm not completely misunderstanding the question):
Add aggs levels for the fields you want in the output in the order you want to drill down. (you can nest aggs within aggs quite deeply without issues and get the bonus of count on each level)
Use the top_hits aggregation on the "lowest" level of aggs, and specify which fields you want in the output using "_source": { "include": [/fields/] }
Can you provide a few records of test data?
Also, it is useful to know which version of ElasticSearch you're running as the syntax and behaviour change a lot between major versions.

Resources