Solr facet equivalent of group by? - sorting

If I have some data like this:
{"field1":"x", "field2":".."}
{"field1":"x", "field2":".."}
{"field1":"y", "field2":".."}
{"field1":"y", "field2":".."}
{"field1":"y", "field2":".."}
Using a simple group=true&group.field=field1&group.limit=0 I get results like this:
{
"responseHeader":{..}
"grouped":{
"field1": {
"matches": 5,
"groups": [
{"groupValue": "x", "doclist":{"numFound": 2, ...}}
{"groupValue": "y", "doclist":{"numFound": 3, ...}}
]
}
}
}
Using this, I know the num of documents found for each groupValue (numFound). The problem is I need to sort the resulting groups in descending order, which is not possible with either sort (a simple sort=numFound would result in an exception, saying the field numFound does not exists and the group.sort would sort the documents inside each group).
Is there an equivalent of this using facets where I can sort the results by count?

You can try:
http://localhost:8983/solr/your_core/select?facet.field=field1&facet.sort=count&facet.limit=-1&facet=on&indent=on&q=*:*&rows=0&start=0&wt=json
The result will be something like:
{
"responseHeader":{
"status":0,
"QTime":17,
"params":{
"q":"*:*",
"facet.field":"field1",
"indent":"on",
"start":"0",
"rows":"0",
"facet":"on",
"wt":"json"}},
"response":{"numFound":225364,"start":0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"field1":[
"x",113550,
"y",111814]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}
}
}
Just tested with Solr 6.3.0.
For more information you can check related part in the Solr documentation.
If you want to compute simultaneously the number of available facets, you can use Solr stats Component (as the field is of type numeric, string, or date).
Have in mind though, server performance and memory overhead issues might appear.
Running a query like:
http://localhost:8983/solr/your_core/select?facet.field=field1&facet.sort=count&facet.limit=10&facet=true&indent=on&q=*:*&rows=0&start=0&wt=json&stats=true&stats.field={!cardinality=true}field1
The response is something like:
{
"responseHeader":{
"status":0,
"QTime":614,
"params":{
"facet.limit":"10",
"q":"*:*",
"facet.field":"field1",
"indent":"on",
"stats":"true",
"start":"0",
"rows":"0",
"facet":"true",
"wt":"json",
"facet.sort":"count",
"stats.field":"{!cardinality=true}field1"}},
"response":{"numFound":2336315,"start":0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"field1":[
"Value1",708116,
"Value2",607088,
"Value3",493949,
"Value4",314433,
"Value5",104478,
"Value6",41099,
"Value7",28879,
"Value8",18767,
"Value9",9308,
"Value10",4545]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}},
"stats":{
"stats_fields":{
"field1":{
"cardinality":27}}}}
For more information about stats you can check here.

Related

$elemMatch with $in SpringData Mongo Query

I am in the process of attempting to create a method that will compose a query using Spring Data and I have a couple of questions. I am trying to perform a query using top level attributes of a document (i.e. the id field) as well as attributes of an subarray.
To do so I am using a query similar to this:
db.getCollection("journeys").find({ "_id._id": "0104", "journeyDates": { $elemMatch: { "period": { $in: [ 1,2 ] } } } })
As you can see I would also like to filter using $in for the values of the subarray. Running the above query though result in wrong results, as if the $elemMatch is ignored completely.
Running a similiar but slightly different query like this:
db.getCollection("journeys").find({ "_id._id": { $in: [ "0104" ] } }, { journeyDates: { $elemMatch: { period: { $in: [ 1, 2 ] } } } })
does seem to yield better results but it returns the only first found element matching the $in of the subarray filter.
Now my question is, how can I query using both top level attributes as well subarrays using $in. Preferably I would like to avoid aggregations. Secondly, how can I translate this native Mongo query to a Spring data Query object?

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Index main-object, sub-objects, and do a search on sub-objects (that return sib-objects)

I've an object like it (simplified here), Each strain have many chromosomes, that have many locus, that have many features, that have many products, ... Here I just put 1 of each.
The structure in json is:
{
"name": "my strain",
"public": false,
"authorized_users": [1, 23, 51],
"chromosomes": [
{
"name": "C1",
"locus": [
{
"name": "locus1",
"features": [
{
"name": "feature1",
"products": [
{
"name": "product1"
//...
}
]
}
]
}
]
}
]
}
I want to add this object in Elasticsearch, for the moment I've add objects separatly: locus, features and products. It's okay to do a search (I want type a keyword, watch in name of locus, name of features, and name of products), but I need to duplicate data like public and authorized_users, in each subobject.
Can I register the whole object in elasticsearch and just do a search on each locus level, features and products ? And get it individually ? (no return the Strain object)
Yes you can search at any level (ie, with a query like "chromosomes.locus.name").
But as you have arrays at each level, you will have to use nested objects (and nested query) to get exactly what you want, which is a bit more complex:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.3/query-dsl-nested-query.html
For your last question, no, you cannot get subobjects individually, elastic returns the whole json source object.
If you want only data from subobjects, you will have to use nested aggregations.

Using elastic search to build flow/funnel results based on unique identifiers

I want to be able to return a set of counts of individual documents from a single index based on a previous set of results, and am wondering if there is a way to do it without running a separate query for each.
So, given a data set like this (simplified version of my ES documents):
{
"name": "visit",
"sessionId": "session1"
},
{
"name": "visit",
"sessionId": "session2"
},
{
"name": "visit",
"sessionId": "session3"
},
{
"name": "click",
"sessionId": "session1"
},
{
"name": "click",
"sessionId": "session3"
}
What I would like to do is be able to search for name: visit and give a count of all those. That part is easy. But I would also like to be able to now count my name: click docs that have the sessionId of the name: visit result set and return a count of how many of those name: click there were as well as the name: visit.
Is there an easy way to do this? I have looked at aggregation APIs but they all seem to not quite fit my needs. There also seems to be a parent/child relationship but it doesn't apply to my situation since both documents I want to individually get counts of are of the same type.
Expected result would be something like this:
{
"count": {
// total number of visit events since this is my start point
"visit": 3,
// the amount of click results that have sessionId
// matching my previous search's sessionId values
"click": 2
}
}
At first glance, you need to do this in two queries:
the first aggregation query to retrieve the sessionIds and
a second aggregation query filtered with those sessionIds to find the count of clicks.
I don't think it's a big deal to run those two queries, but that depends on how much data you have and how many sessionIds you want to retrieve at once.

Sorting a match query with ElasticSearch

I'm trying to use ElasticSearch to find all records containing a particular string. I'm using a match query for this, and it's working fine.
Now, I'm trying to sort the results based on a particular field. When I try this, I get some very unexpected output, and none of the records even contain my initial search query.
My request is structured as follows:
{
"query":
{
"match": {"_all": "some_search_string"}
},
"sort": [
{
"some_field": {
"order": "asc"
}
}
] }
Am I doing something wrong here?
In order to sort on a string field, your mapping must contain a non-analyzed version of this field. Here's a simple blog post I found that describes how you can do this using the multi_field mapping type.

Resources