Getting Distinct fields from Elasticsearch - elasticsearch

I have 1Million documents which has a field called id.The id field of all the 1Million docs are different.
Eg:1.id:http://www.bing.com/search?q=malaysia. 2.id:http://www.google.com/search?q=singapore. 3.id:http://www.bing.com/search?q=india. 4.id:http://www.google.com/search?q=america 5.id:http://www.duckduckgo.com/?q=africa 6.id:http://www.duckduckgo.com/?q=asia
Can someone help me to form a query to get only the 3 distinct urls here.I just want to get google.com,bing.com,duckduckgo.com .

Well can text the syntax, but this should work. Just use a script to split your url string.
{
"aggs": {
"urls": {
"terms": {
"field": "id",
"script" : "def path = doc['id'].value; int currentSplit = path.indexOf("//"); if (currentSplit > 0) { path = path.substring(currentSplit + 1); currentSplit = path.indexOf("/"); if (currentSplit > 0) { path = path.substring(0, currentSplit) } } return path"
}
}
}
}
The best practice should be to index the domain name on the document if you need this aggregation a lot :).

Related

How to compare two date fields in same document in elasticsearch

In my elastic search index, each document will have two date fields createdDate and modifiedDate. I'm trying to add a filter in kibana to fetch the documents where the modifiedDate is greater than createdDate. How to create this filter in kibana?
Tried Using below query instead of greater than it is considering as gte and fetching all records
GET index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script" : {
"inline" : "doc['modifiedTime'].value.getMillis() > doc['createdTime'].value.getMillis()",
"lang" : "painless"
}
}
}
}
}
}
There are a few options.
Option A: The easiest and most performant one is to store the difference of the two fields inside a new field of your document, e.g.
{
"createDate": "2022-01-11T12:34:56Z",
"modifiedDate": "2022-01-11T12:34:56Z",
"diffMillis": 0
}
{
"createDate": "2022-01-11T12:34:56Z",
"modifiedDate": "2022-01-11T12:35:58",
"diffMillis": 62000
}
Then, in Kibana you can query on diffMillis > 0 and figure out all documents that have been modified after their creation.
Option B: You can use a script query
GET index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": """
return doc['createdDate'].value.millis < doc['modifiedDate'].value.millis;
"""
}
}
}
}
}
Note: depending on the amount of data you have, this option can potentially have disastrous performance, because it needs to be evaluated on ALL of your documents.
Option C: If you're using ES 7.11+, you can use runtime fields directly from the Kibana Discover view.
You can use the following script in order to add a new runtime field (e.g. name it diffMillis) to your index pattern:
emit(doc['modifiedDate'].value.millis - doc['createdDate'].value.millis)
And then you can add the following query into your search bar
diffMillis > 0

Elasticsearch - get (unfiltered) aggregates for a (filtered) subset

I have an elasticsearch index containing "hit" documents (with fields like ip/timestamp/uri etc) which are populated from my nginx access logs.
I'm looking for a method of getting the total number of hits / ip - but for a subset of IPs, namely the ones that did a request today.
I know I can have a filtered aggregation by doing:
/search?size=0
{
'query': { 'bool': { 'must': [
{'range': { 'timestamp': { 'gte': $today}}},
{'query_string': {'query': 'status:200 OR status:404'}},
]}},
'aggregations': {'c': {'terms': {'field': 'ip', 'size': 99999}}}
}
but this will sum only the hits that were done today, I want the total number of hits in the index but only from IPs that have hits today. Is this possible?
-edit-
I've tried the global option but while
'aggregations': {'c': {'global': {}, 'aggs': {'c2': {'terms': {'field': 'remote_user', 'size': 99999}}}}}
returns counts from all IPs; it ignores my filter on timestamp (eg. it includes IPs that did hits a couple of days ago)
There is a way to achieve what you want in a single query but since it involves scripting and the performance might suffer depending on the volume of data you will be running this query on.
The idea is to leverage the scripted_metric aggregation in order to build your own aggregation logic over the whole document set.
What we do below is pretty simple:
we don't give any query, so we consider the full document set
Map phase: we build a map of all IPs and for each
we count the total number of hits
we flag it if it had hits today AND with the given status (same as what you do in your query)
Reduce phase: we return the total hits count for each IP that was flagged as having hits today
Here is how the query looks like:
POST my-index/_search
{
"size": 0,
"aggs": {
"all_time_hits": {
"scripted_metric": {
"init_script": "state.ips = [:]",
"map_script": """
// initialize total hits count for each IP and increment
def ip = doc['ip.keyword'].value;
if (state.ips[ip] == null) {
state.ips[ip] = [
'total_hits': 0,
'hits_today': false
]
}
state.ips[ip].total_hits++;
// flag IP if:
// 1. it has hits today
// 2. the hit had one of the given statuses
def today = Instant.ofEpochMilli(new Date().getTime()).truncatedTo(ChronoUnit.DAYS);
def hitDate = doc['timestamp'].value.toInstant().truncatedTo(ChronoUnit.DAYS);
def hitToday = today.equals(hitDate);
def statusOk = params.statuses.indexOf((int) doc['status'].value) >= 0;
state.ips[ip].hits_today = state.ips[ip].hits_today || (hitToday && statusOk);
""",
"combine_script": "return state.ips;",
"reduce_script": """
def ips = [:];
for (state in states) {
for (ip in state.keySet()) {
// only consider IPs that had hits today
if (state[ip].hits_today) {
if (ips[ip] == null) {
ips[ip] = 0;
}
ips[ip] += state[ip].total_hits;
}
}
}
return ips;
""",
"params": {
"statuses": [200, 404]
}
}
}
}
}
And here is how the answer looks like:
"aggregations" : {
"all_time_hits" : {
"value" : {
"123.123.123.125" : 1,
"123.123.123.123" : 4
}
}
}
I think that pretty much does what you expect.
The other option (more performant because no script) requires you to make two queries. First, a query with the date range and status check with a terms aggregation to retrieve all IPs that have hits today (like you do now), and then a second query where you filter on those IPs (using a terms query) over the whole index (no date range or status check) and get hits count for each of them using a terms aggregation.
In the example you have shared you have a query and your documents are filtered according to that. But you want your aggregation to take all documents regardless of the query.
This is why the global option exists.
This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
Sample query example:
{
"query": {
"match": { "type": "t-shirt" }
},
"aggs": {
"all_products": {
"global": {},
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
}
}

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Elastic search Group by count for particular field

I have a elastic search index with following documents
{
"id":1
"mainid ": "497940311988134801282012-04-10 ",
}
{
"id":2
"mainid ": "497940311988134801282012-04-10 ",
}
I am looking to have a query similar like -example mysql table
id mainid
1 497940311988134801282012-04-10
2 497940311988134801282012-04-10
3 497940311988134801282012-04-10
4 something different
select id ,mainid ,count(mainid) as county from wfcharges group by mainid,id having county>1;
in elastic search ,as there is no count aggregate function is available in elastic .I am stuck here.This is what ,I have tried. Any suggestions or online resources.Thanks
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"count" : { "field" : "mainid" }
}
}
}
I think you'd want to use the terms aggregation. This will group by similar terms and return a count of each term. Look at the linked url for example.
In you case, it would look like this:
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"terms" : { "field" : "mainid" }
}
}
}
This query is going to be exactly what you need:
GET /wfcharges/_search
{
"aggs": {
"countfield": {
"terms": {
"field": "mainid",
"min_doc_count": 2
}
}
}
}
It's going to aggregate by mainid field and tell that minimum document count for this bucket has to be 2 ( more than 1):

ElasticSearch: Using an existing field in script param

I am trying to create a nested object and set the field value to be a document fields value. I can create a non nested field with my logic value and I can create a nested field with a hard coded value. But I cannot get the two of these things to work together.
Here is what I have so far.
Create a nested field:
{
"script": "ctx._source.displayFields = displayField",
"params": {
"displayField": {
"displayField": 11
}
}
}
Or I can use a script to fetch the value and sent a field like this:
{
"script" : "if (ctx._source['fielda'] == 'term1') {
ctx._source['displayField'] = ctx._source['field2']; }
else if (ctx._source['fielda'] == 'term2') {
ctx._source['displayFields.displayPrice'] = ctx._source['fieldb'];
}
But if I try and put a script in the param field like either of the below I always get an error. Any advice would be greatly appreciated.
Things I have tried and not worked:
{
"script": "ctx._source.displayFields = displayField",
"params": {
"displayField": {
"displayField": "tag"
},
"tag" : {
"script": "ctx._source['numberField']"
}
}
}
As well as trying to assign a script as its subfield or putting it as the value.

Resources