Filters with AND on nested resources - elasticsearch

My problem is : Elasticsearch count is not the same than my database.
I indexed "users" table, each user can have one or many apps_events :
curl localhost:9200/users/_count
{"count":190291,"_shards":{"total":5,"successful":5,"failed":0}}
SELECT COUNT(*) FROM users
count : 190291
=> Same count, everything is ok !
But, when I do a search on 2 filters, one term and one terms one the nested resource :
curl -X GET 'http://localhost:9200/users/user/_search?load=&size=10&pretty' -d '
{
"query": {
"match_all": {
}
},
"filter": {
"and": [
{
"terms": {
"apps_events.type": [
"sale"
]
}
},
{
"term": {
"apps_events.status": "active"
}
}
]
},
"size": 10
}
total : 63756
And in my database :
SELECT
COUNT(DISTINCT(users_id))
FROM
apps_event
WHERE
apps_event_state_id = 1 AND apps_event_project_id = 2;
count : 63340
Because in fact, elasticsearch SQL equivalent query is:
SELECT
COUNT(DISTINCT(users_id))
FROM apps_event
WHERE apps_event_state_id = 1
AND users_id IN
(SELECT DISTINCT(users_id) FROM apps_event WHERE apps_event_project_id = 2)
count : 63756
===> How I can do a simple "AND" for each resource ?
Thanks

You've probably checked this, but is apps_event_project_id the right corollary to apps_events.type? They don't seem the same on the surface, but you would know that for sure. Also, does users_id map directly to ES _id? It could be that you've got duplicates in your index which inflate its count.

Best resource ever for "nested resource" :
http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/

Related

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Elastic search Group by count for particular field

I have a elastic search index with following documents
{
"id":1
"mainid ": "497940311988134801282012-04-10 ",
}
{
"id":2
"mainid ": "497940311988134801282012-04-10 ",
}
I am looking to have a query similar like -example mysql table
id mainid
1 497940311988134801282012-04-10
2 497940311988134801282012-04-10
3 497940311988134801282012-04-10
4 something different
select id ,mainid ,count(mainid) as county from wfcharges group by mainid,id having county>1;
in elastic search ,as there is no count aggregate function is available in elastic .I am stuck here.This is what ,I have tried. Any suggestions or online resources.Thanks
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"count" : { "field" : "mainid" }
}
}
}
I think you'd want to use the terms aggregation. This will group by similar terms and return a count of each term. Look at the linked url for example.
In you case, it would look like this:
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"terms" : { "field" : "mainid" }
}
}
}
This query is going to be exactly what you need:
GET /wfcharges/_search
{
"aggs": {
"countfield": {
"terms": {
"field": "mainid",
"min_doc_count": 2
}
}
}
}
It's going to aggregate by mainid field and tell that minimum document count for this bucket has to be 2 ( more than 1):

Convert Sql to ES query

I want to get All the records that has latest "created_on" time from elastic search documents.
In sql what i need is
select * from table1
where created_on = (select max(created_on) from table1)
But i'm new to ES and don't know how to do it.
I Can first get the Max(created_on) date from ES and query again to get all the records that has Max(created_on).
Is there a way to get this with single query?
{
"filter" : {
"match_all" : { }
},
"sort": [
{
"created_on": {
"order": "desc"
}
}
],
"size": 1
}
You can try this query and let me know if this works.
You can try
// descending order
var entity= ctx.Table1.OrderByDescending(s => s.Created_on).FirstOrDefault();

Elasticsearch query returns 10 when expecting > 10,000

I want to retrieve all the JSON objects in Elasticsearch that have a null value for awsKafkaTimestamp. This is the query I have set up:
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "tracer.awsKafkaTimestamp"
}
}
}
}
}
When I curl to my elasticsearch endpoint with the DSL I only get a few values back. I am expecting all (10000+) of them because I know for sure all the awsKafkaTimestamp values are null
This is the response I get when I use Postman. As you can see, there are only 10 JSON objects returned to me:
It's correct behaviour of the elasticsearch. By default, it only returns 10 records and provides information in hits.total field about the total number of documents matching search criteria. To retrieve more data than 10 you should specify size field in your query as shown below (you can read more about it here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html):
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "user" : "kimchy" }
}
}
By default elasticsearch will give you 10 results, even if it matches to 10212. You can set the size parameter but that is limited to 10000, so your only option is to use the scroll API to get,
Example from elasticsearch site Scroll API
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'

ElasticSearch - Limit size of nested collection on Query Result

If I have blog-post with thousands (or hundred-thousands) of nested comments and I want to retrieve just the top 10 blog-posts. I will just use size to control how many blog-posts I want to retrieve, but I am not sure how to limit the size of how many nested comments I want.
e.g. This will return top 10 blog-posts with unlimited comments
GET myblog/_search
{
"size": 10,
"query": {
"match_all": {}
}
}
I try inner_hits but it doesn't work for me. When I used, I have to do a query in the nested-comments, I also disabled the source (to avoid retrieving post with all comments), and the inner_hits result will give me each comment with each post (redundant) even though in some cases it is the same parent-post.
I also thought about parent-child approach, but this mean creates multiple request/queries.
Do you know how to limit the size of a nested collection in a query?
What I am looking for is to create a query that I can do something like get top 10 blog-posts with top 5 comments.
Can you try this query:
{
"_source": false,
"fields":["your_fields"],
"size": 10,
"query": {
"match_all": {}
},
"inner_hits" : {
"comments" : {
"path" : {
"comments" : {
"size":5,
"query" : {
"match_all": {}
}
}
}
}
}
}

Resources