Elasticsearch result with group by filters - elasticsearch

I'm implementing Elasticsearch on my system, but I have a question:
I am creating a job portal, I currently use the request below to list all available positions:
GET /companies/job/_search
{
"sort" : [
{ "post_date" : {"order" : "asc"}},
"_score"
]
}
And with that, I get the result, all available positions (2357), example:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2537,
"max_score": 1.9790175,
"hits": [
{
"_index": "companies",
"_type": "job",
"_id": "2",
"_score": 1.9790175,
"_source": {
"name": "HTML Developer (1 - 2 Yrs Exp.)",
"category": "Graphic Designer",
"location": "Nolda",
"skills": "Javascript"
}
},
{
"_index": "companies",
"_type": "job",
"_id": "114",
"_score": 0.30432263,
"_source": {
"name": "PHP Developer (2 Yrs Exp.)",
"category": "Engineering Job",
"location": "Pune",
"skills": "PHP"
}
}
]
}
}
But I wanted to display filters in a sidebar based on this list that was returned. Similar to the attached prototype.
Example:
2357 vacancies were returned.
In the list next to, shows that of the total vacancies, grouping by categories, we have 214 are graphic designer, 514 are for engineering, etc ...
Grouping by Location, we have: 1254 for Nolda, 221 for Pune, etc ...
I would like to know, if in the same request I make the query to return all available jobs, it would be possible to also bring the groupings.
Or if I have to make two requests, one to bring all the jobs, and another to bring the groupings (and the counters for each item in the grouping).

Try this,
GET companies/job/_search {
"size": 0, "aggs": {
"group_by_state": {
"terms": {
"field": "category.keyword", "size": 15
}
}
}
}
where field is your field that you like to category,location,skills
then size will be your required results count for a page.
Hope it will help :)

Related

Elasticsearch search for a child and all his sibling documents grouped by parent

I would like to be able to submit a query which matches on child documents and returns the parent and all his child documents.
I have parent and child documents in my Elasticsearch index related through a join: https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html?baymax=rec&rogue=rec-1&elektra=guide.
I have items divided into groups, each item in my index is a separate child document(NOTE: It's required to be able search children separately by different query, so I can NOT use Nested objects). The parent document contains a few meaningful fields like (name, sku, image) so it's required to get Parent along with its children.
I've achieved my requirements using following query:
GET my_index/_search
{
"query": {
"has_child": {
"type": "child",
"query": {
"has_parent": {
"parent_type": "parent",
"query": {
"has_child": {
"type": "child",
"query": {
"multi_match": {
"query": "NV1540JR",
"fields": [
"name",
"sku"
]
}
}
}
}
}
},
"inner_hits": {}
}
}
}
It's returns following result, which is exactly what I need:
{
"took": 301,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9GEAT",
"_score": 1.0,
"_source": {
"id": "Az9GEAT",
"name": "Gold Calacatta 2.0",
"sku": "NV1540",
"my_join-field": "parent"
},
"inner_hits": {
"child": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "zx9EEAR",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "zx9EEAR",
"name": "Gold Calacatta 12\" x 24\"",
"sku": "NV1540M-2",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "Az9NEAT",
"_score": 1.0,
"_routing": "Az9GEAT",
"_source": {
"id": "Az9NEAT",
"name": "Gold Calacatta 2.0, 24\" x 48\"",
"sku": "NV1540JR",
"familyName": "Gold Calacatta 2.0",
"familySku": "NV1540",
"my_join-field": {
"name": "child",
"parent": "Az9GEAT"
}
}
}
]
}
}
}
}
]
}
}
In other way I could implement Application-side Join by making three different query calls(one to get all matching data, second to get siblings, third to get parents) and combining result in my Application. But not sure that it gonna be faster, cos of http request time and data processing time.
So, I'm a very newbee in elasticsearch and can't estimate how bad it is. How does it's affects the query performance? If there any other ways to get desired result? Or how my query could be improved? I'd be glad to hear any suggestions or thoughts! Thanks
For ES it's a standard practice to retrieve a list of object ids & performs a second request to return a complete document set.
You can implement your logic using 2 queries
Request (1) all documents satisfying your child search criteria. Select only child.id & child.parent_id fields to ensure you load only index data, no document _source searched. Request will be relatively fast
In your application code determine unique list of parent_ids & orphaned_child_ids
Request (2) all documents satisfying criteria: parent_id in parent_ids OR parent_id = NULL AND child_id in orphaned_child_ids

Searching within a percentile in elastic [duplicate]

Say if I want to filter documents by some field within 10th to 20th percentile. I'm wondering if it's possible by some simple query, something like {"fieldName":{"percentile": [0.1, 0.2]}}.
Say I have these documents:
[{"a":1,"b":101},{"a":2,"b":102},{"a":3,"b":103}, ..., {"a":100,"b":200}]
I need to filter the top 10th of them by a (with ascending order), that would be a from 1 to 10. Then I need to sort those results by b with descending order, then take the paginated result (like page No.2, with 10 items every page).
One solution in mind would be:
get the total count of the documents.
sort the documents by a, take the corresponding _id with limit 0.1 * total_count
write the final query, something like id in (...) order by b
But the shortcomings are pretty obvious too:
seems not effecient if we're talking about subsecond latency
the second query might not work if we have too many _id returned in the first query (ES only allows 1000 by default. I can change the config of course, but there's always a limit).
I doubt that there is a way to do this in one query if the exact values of a are not known beforehand, although I think one pretty efficient approach is feasible.
I would suggest to do a percentiles aggregation as first query and range query as second.
In my sample index I have only 14 documents, so for explanatory reasons I will try to find those documents that are from 30% to 60% of field a and sort them by field b in inverse order (so to be sure that sort worked).
Here are the docs I inserted:
{"a":1,"b":101}
{"a":5,"b":105}
{"a":10,"b":110}
{"a":2,"b":102}
{"a":6,"b":106}
{"a":7,"b":107}
{"a":9,"b":109}
{"a":4,"b":104}
{"a":8,"b":108}
{"a":12,"b":256}
{"a":13,"b":230}
{"a":14,"b":215}
{"a":3,"b":103}
{"a":11,"b":205}
Let's find out which are the bounds for field a between 30% and 60% percentiles:
POST my_percent/doc/_search
{
"size": 0,
"aggs" : {
"percentiles" : {
"percentiles" : {
"field" : "a",
"percents": [ 30, 60, 90 ]
}
}
}
}
With my sample index it looks like this:
{
...
"hits": {
"total": 14,
"max_score": 0,
"hits": []
},
"aggregations": {
"percentiles": {
"values": {
"30.0": 4.9,
"60.0": 8.8,
"90.0": 12.700000000000001
}
}
}
}
Now we can use the boundaries to do the range query:
POST my_percent/doc/_search
{
"query": {
"range": {
"a" : {
"gte" : 4.9,
"lte" : 8.8
}
}
},
"sort": {
"b": "desc"
}
}
And the result is:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "my_percent",
"_type": "doc",
"_id": "vkFvYGMB_zM1P5OLcYkS",
"_score": null,
"_source": {
"a": 8,
"b": 108
},
"sort": [
108
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vUFvYGMB_zM1P5OLWYkM",
"_score": null,
"_source": {
"a": 7,
"b": 107
},
"sort": [
107
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vEFvYGMB_zM1P5OLRok1",
"_score": null,
"_source": {
"a": 6,
"b": 106
},
"sort": [
106
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "u0FvYGMB_zM1P5OLJImy",
"_score": null,
"_source": {
"a": 5,
"b": 105
},
"sort": [
105
]
}
]
}
}
Note that the results of percentiles aggregation are approximate.
In general, this looks like a task better solved by pandas or a Spark job.
Hope that helps!

GET query based on timestamp

I’m looking for help on building a query that will retrieve the last number of documents for a given time frame, for example last 30 minutes.
The documents are syslogs like:
{
"_index": "logstash-2017.01.16",
"_type": "syslog",
"_id": "AVmnIUFGd2leAWt2KJSr",
"_score": 1,
"_source": {
"#timestamp": "2017-01-16T11:54:48.318Z",
"syslog_severity_code": 5,
"syslog_facility": "user-level",
"#version": "1",
"host": "10.0.0.1",
"syslog_facility_code": 1,
"message": "Test Syslog Message",
"type": "syslog",
"syslog_severity": "notice",
tags": [
"_grokparsefailure"
]
}
My idea is to build this query into another script that will check for new items being added to ES.
Use Range Query:
GET index/type/_count
{
"query": {
"range": {
"#timestamp": {
"from": "now-30m",
"to" : "now"
}
}
}
}
This will give output like :
{
"count": 2,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
}
}
where count will carry the number of document matched.
Read more about Range Query here

Finding multiple Elasticsearch documents with same ids, different types

I need to find out if any document with a certain id was already indexed in my ES database, so that I can delete them before indexing a new document.
The trouble is I do not know a priori the type it was indexed as.
I found the _mget query which sounds like it could be what I need, but then this quote in the documentation says I only get 1 (random) hit when searching
If you don’t set the type and have many documents sharing the same
_id, you will end up getting only the first matching document.
how can I get this behaviour; finding all documents sharing an _id, possibly > 1 with different _type in the same index without an expensive _search query?
thanks!
A simple term query on "_id" worked for me.
So I created a trivial index and added two documents each, for two different types:
PUT /test_index
POST /test_index/_bulk
{"index":{"_type":"type1","_id":1}}
{"name":"type1 doc1"}
{"index":{"_type":"type1","_id":2}}
{"name":"type1 doc2"}
{"index":{"_type":"type2","_id":1}}
{"name":"type2 doc1"}
{"index":{"_type":"type2","_id":2}}
{"name":"type2 doc2"}
And this query will return both documents with id 1:
POST /test_index/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"_id": "1"
}
}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "type1",
"_id": "1",
"_score": 1,
"_source": {
"name": "type1 doc1"
}
},
{
"_index": "test_index",
"_type": "type2",
"_id": "1",
"_score": 1,
"_source": {
"name": "type2 doc1"
}
}
]
}
}
Here's the code I used:
http://sense.qbox.io/gist/a8085b57c22631148dd4c67769307caf6425fd95

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

Resources