Elasticsearch query with nested aggregations causing out of memory - elasticsearch

I have Elasticsearch installed with 16gb of memory. I started using aggregations, but ran into a "java.lang.OutOfMemoryError: Java heap space" error when I attempted to issue the following query:
POST /test-index-syslog3/type-syslog/_search
{
"query": {
"query_string": {
"default_field": "DstCountry",
"query": "CN"
}
},
"aggs": {
"whatever": {
"terms": {
"field" : "SrcIP"
},
"aggs": {
"destination_ip": {
"terms": {
"field" : "DstIP"
},
"aggs": {
"port" : {
"terms": {
"field" : "DstPort"
}
}
}
}
}
}
}
}
The query_string itself only returns 1266 hits so I'm a bit confused by the OOM error.
Am I using aggregations incorrectly? If not, what can I do to troubleshoot this issue?
Thanks!

You are loading the entire SrcIP-, DstIP-, and DstPort-fields into memory in order to aggregate on them. This is because Elasticsearch un-inverts the entire field to be able to rapidly look up a document's value for a field given its ID.
If you're going to largely be aggregating on a very small set of data, you should look into using docvalues. Then a document's value is stored in a way that makes it easy to look up given the document's ID. There's a bit more overhead to it, but that way you'll leave it to the operating system's field cache to have the relevant pages in memory, instead of having to load the entire field.

Not sure about the mapping of course, but looking at the value the field DstCountry can be non_analyzed. Than you could replace the query by a filter within the aggregate. Maybe that helps.
Also check if the fields you use in your aggregation are of type non_analyzed.

Related

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

Find documents in Elasticsearch where `ignore_malformed` was triggered

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.
This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.
This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.
Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.
You can use the exists query to find document where this field does not exist, see this example
PUT foo
{
"mappings": {
"bar": {
"properties": {
"baz": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT foo/bar/1
{
"baz": "field"
}
GET foo/bar/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must_not": [
{
"exists": {
"field": "baz"
}
}
]
}
}
}
}
}
There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally
You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.
ignore_malformed flag is saying "always store document, analyze if possible".
You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.
So forget the inverted index, let's use the source!
Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.
Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:
Here I want to retrieve the document where country_name is not a string:
{
"_source": false,
"timeout" : "30s",
"query" : {
"query_string" : {
"query" : "locale:de_ch"
}
},
"post_filter": {
"script": {
"script": "!(_source.country_name instanceof String)"
}
}
}
"_source:false" => I want only document ID
"timeout" => prevent crash
As you notice, this is a missing feature, I know logstash will tag
document that fail, so elasticsearch could implement the same thing.

Scope Elasticsearch Results to Specific Ids

I have a question about the Elasticsearch DSL.
I would like to do a full text search, but scope the searchable records to a specific array of database ids.
In SQL world, it would be the functional equivalent of WHERE id IN(1, 2, 3, 4).
I've been researching, but I find the Elasticsearch query DSL documentation a little cryptic and devoid of useful examples. Can anyone point me in the right direction?
Here is an example query which might work for you. This assumes that the _all field is enabled on your index (which is the default). It will do a full text search across all the fields in your index. Additionally, with the added ids filter, the query will exclude any document whose id is not in the given array.
{
"bool": {
"must": {
"match": {
"_all": "your search text"
}
},
"filter": {
"ids": {
"values": ["1","2","3","4"]
}
}
}
}
Hope this helps!
As discussed by Ali Beyad, ids field in the query can do that for you. Just to complement his answer, I am giving an working example. In case anyone in the future needs it.
GET index_name/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"field": "your query"
}
},
{
"ids" : {
"values" : ["0aRM6ngBFlDmSSLpu_J4", "0qRM6ngBFlDmSSLpu_J4"]
}
}
]
}
}
}
You can create a bool query that contains an Ids query in a MUST clause:
https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-dsl-ids-query.html
By using a MUST clause in a bool query, your search will be further limited by the Ids you specify. I'm assuming here by Ids you mean the _id value for your documents.
According to es doc, you can
Returns documents based on their IDs.
GET /_search
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
}
}
With elasticaBundle symfony 5.2
$query = new Query();
$IdsQuery = new Query\Ids();
$IdsQuery->setIds($id);
$query->setQuery($IdsQuery);
$this->finder->find($query, $limit);
You have two options.
The ids query:
GET index/_search
{
"query": {
"ids": {
"values": ["1, 2, 3"]
}
}
}
or
The terms query:
GET index/_search
{
"query": {
"terms": {
"yourNonPrimaryIdField": ["1", "2","3"]
}
}
}
The ids query targets the document's internal _id field (= the primary ID). But it often happens that documents contain secondary (and more) IDs which you'd target thru the terms query.
Note that if your secondary IDs contain uppercase chars and you don't set their field's mapping to keyword, they'll be normalized (and lowercased) and the terms query will appear broken because it only works with exact matches. More on this here: Only getting results when elasticsearch is case sensitive

Filter facet returns count of all documents and not range

I'm using Elasticsearch and Nest to create a query for documents within a specific time range as well as doing some filter facets. The query looks like this:
{
"facets": {
"notfound": {
"query": {
"term": {
"statusCode": {
"value": 404
}
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
In the specific case, the total hits of the search is 21 documents, which fits the documents within that time range in Elasticsearch. But the "notfound" facet returns 38, which fits the total number of ErrorDocuments with a StatusCode value of 404.
As I understand the documentation, facets collects data from withing the search. In this case, the "notfound" facet should never be able to return a count higher that 21.
What am I doing wrong here?
There's a distinct difference between filter/query/filtered_query/facet filter which is good to know.
Top level filter
{
filter: {}
}
This acts as a post-filter, meaning it will filter the results after the query phase has ended. Since facets are part of the query phase filters do not influence the documents that are facetted over. Filters do not alter score and are therefor very cacheable.
Top level query
{
query: {}
}
Queries influence the score of a document and are therefor less cacheable than filters. Queries run in the query phase and thus also influence the documents that are facetted over.
Filtered query
{
query: {
filtered: {
filter: {}
query: {}
}
}
}
This allows you to run filters in the query phase taking advantage of their better cacheability and have them influence the documents that are facetted over.
Facet filter
"facets" : {
"<FACET NAME>" : {
"<FACET TYPE>" : {
...
},
"facet_filter" : {
"term" : { "user" : "kimchy"}
}
}
}
this allows you to apply a filter to the documents that the facet is run over. Remember that the it'll be a combination of the queryphase/facetfilter unless you also specify global:true on the facet as well.
Query Facet/Filter Facet
{
"facets" : {
"wow_facet" : {
"query" : {
"term" : { "tag" : "wow" }
}
}
}
}
Which is the one that #thomasardal is using in this case which is perfectly fine, it's a facet type which returns a single value: the query hit count.
The fact that your Query Facet returns 38 and not 21 is because you use a filter for your time range.
You can fix this by either doing the filter in a filtered_query in the query phase or apply a facet filter(not a filter_facet) to your query_facet although because filters are cached better you better use facet filter inside you filter facet.
Confusingly Filter Facets are specified using .FacetFilter() on the search object. I will change this in 1.0 to avoid future confusion.
Sadly: .FacetFilter() and .FacetQuery() in NEST do not allow you to specify a facet filter like you can with other facets:
var results = typedClient.Search<object>(s => s
.FacetTerm(ft=>ft
.OnField("myfield")
.FacetFilter(f=>f.Term("filter_facet_on_this_field", "value"))
)
);
You issue here is that you are performing a Filter Facet and not a normal facet on your query (which will follow the restrictions applied via the query filter). In the JSON, the issue is because of the "query" between the facet name "notfound" and the "terms" entry. This is telling Elasticsearch to run this as a separate query and facet on the results of this separate query and not your main query with the date range filter. So your JSON should look like the following:
{
"facets": {
"notfound": {
"term": {
"statusCode": {
"value": 404
}
}
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"from": "2014-04-05T05:25:37",
"to": "2014-04-07T05:25:37"
}
}
}
]
}
}
}
Since I see you have this tagged with NEST as well, in your call using NEST, you are probably using FacetFilter on your search request, switch this to just Facet to get the desired result.

Elasticsearch doesn't return results

I am facing a strange issue in elasticsearch query. I don't know much about elasticsearch. My query is:
{
"query":
{
"bool":
{
"must":
[
{
"text":
{
"countryCode2":"DE"
}
}
],
"must_not":[],
"should":[]
}
},"from":0,"size":1,"sort":[],"facets":{}
}
The issues is for "DE". It is giving me results but for "BE" or "IN" it returns empty result.
You are indexing using the default mapping, which by default removes english stopwords. The country codes "IN", "BE", and many more are stopwords which don't even get indexed, therefore it's not possible to have matching documents, nor get back those country codes when faceting on that field.
The solution is to reindex after having submitted your own mapping for the country code field:
{
"your_type_name" : {
"country" : {
"type" : "string", "index" : "not_analyzed"
}
}
}
If you already tried to do this but nothing changed, the mapping didn't get submitted properly. I would suggest to double check that its json structure is correct and that you can actually get it back using the get mapping api.
As this is a common problem the defaults are probably going to change in the future to be less intrusive and avoid applying any language dependent text analysis.

Resources