Checking "never seen" values in Elasticsearch - elasticsearch

I'm using ES 5.X for indexing syslog messages with a Timestamp.
At the end of the day I need to make a query to know, for a given field, which values are never seen before in my index history.
Any ideas how to achive this goal in an efficient way?
As an example,
Suppose in date 2017/06/19 the following document has been indexed:
{
"text": "hello",
"date": "2017/06/19"
}
Now, in date 2017/06/20 the following documents has been indexed:
{
"text": "hello",
"date": "2017/06/20"
}
{
"text": "world",
"date": "2017/06/20"
}
{
"text": "from",
"date": "2017/06/20"
}
{
"text": "Europe",
"date": "2017/06/20"
}
At 23:59 of 2017/06/20 I want to know which new values for text field has been discovered today. I'm wondering if is there any better solution than take each single value and query the text field with a range filter.
The query should return "world", "from", "Europe".

Related

terms for each field vs values for each field in _all elasticsearch

I just started learning elasticsearch and would like to know what is the difference between terms and value in the following sentence that I copied from elasticsearch website:
"It is important to note that the _all field combines the original values from each field as a string. It does not combine the terms from each field.
While I understand what a value is, I have been scratching my head over terms for each field!
Can someone help me what it means, please?
The paragraph preceding the one you have pasted gives some explanation:
The date_of_birth field in the above example is recognised as a date field and so will index a single term representing 1970-10-24 00:00:00 UTC. The _all field, however, treats all values as strings, so the date value is indexed as the three string terms: "1970", "24", "10".
In other words, the _all field takes the original values from the indexed document and runs them through its own analyzer, producing its own terms which are then stored in the index. It does not use the terms produced by analyzers of other fields.
One example is given in the paragraph I've pasted above. It explains that the date_of_birth field will be recognized as a date type and therefore will analyze and store the field value as a single term 1970-10-24 00:00:00 UTC. So if you will try to match the date_of_birth field with a match query like this:
{ "query": { "match: { "date_of_birth": "24 10" } } }
You won't find that document because the parser won't be able to parse the provided value as a date.
On the other hand, if you will run the same query on the _all field, you will definitely find that document:
{ "query": { "match: { "_all": "24 10" } } }
Because, as the documentation suggests, the _all field will include following text type terms: ["1970", "10", "24"].
Let's look at another example. Assume you have the following mapping of user type:
"user": {
"properties": {
"nickname": { "type": "keyword" },
"name": { "type": "text" },
"age": { "type": "integer" }
}
}
And you index the following document:
{
"nickname": "Super-Man",
"name": "John",
"age": 25
}
Elasticsearch will analyze the fields of this document according to their types, eventually storing following terms for each of these fields:
_all: ["super", "man", "john", "25"] - all strings
nickname: ["Super-Man"]
name: ["john"]
age: [25] - integer
Therefore, if you will try to find this document using a match (or a term) query where nickname equals to super you won't find it. Because nickname field was analyzed as a keyword, you must use the exact string to find it - "Super-Man".
But if you try to find this document using a match query where _all equals to super, you will find it.
On the other hand, if you try to find this document using a term query over the _all field an integer value 25, you won't find it. Again, because _all field is just a text field:
{ "query": { term": { "_all": 25} } }
But running the same query on the age field will return the document:
{ "query": { term": { "age": 25} } }

no matches in array in elasticsearch

I am using elasticsearch 5.2.2.
in my index I have data looking like this:
{
"_index": "index",
"_type": "273caf76-ec03-478c-b980-9743180bc863",
"_id": "eee46e24-f383-4ae7-8930-dc3836e030a5",
"_score": 3.41408,
"_source": {
"Father Name": [
{
"id": "some id",
"value": "Some value test test"
}
],
"Mother Name": [
{
"id": "some id",
"value": "Another value haha"
}
],
"Other values": [{ id: "", value: ""}]
}
}
When I am searching with _all, everything works fine and I can find all the results with reasonable scores
{"query":{"match":{"_all":"value"}},"from":0,"size":20}
But that query is searching in all the fields. If I want for instance just to find results in Father Name or in Father Name and Mother Name, then I find nothing.
{"query":{"match":{"Father Name":"value"}},"from":0,"size":20}
My goal is to find in have a search like with _all, but limited to a few fields.
Your fields Father Name and Mother Name are arrays of inner objects.
To search within the value field within Father Name, for example, do
curl -XGET localhost:9200/myindex/_search?pretty -d '
{
"query": {
"match": {
"Father Name.value": "first"
}
},
"from": 0,
"size": 20
}'
I'm not sure, however, how to query for all fields within Father Name.
Reference Arrays of Inner Objects
Sam Shen's answer is the way to go if you need to configure which properties to use on a per-query basis.
One alternative is to configure the fields to not be included in the _all query.
For example, this would cause only Father Name to be included in the _all query, by disabling all the fields at the type level, then enabling all of the subfields on Father Name.
PUT index
{
"mappings": {
"type": {
"include_in_all": false,
"properties": {
"Father Name" : {
"include_in_all": true
}
}
}
}
You can set the include_in_all property on any level in the mapping, including subfield properties.
The big drawback here is that this isn't configured on a query-by-query basis, this is configured for all queries attempting to use the _all field.

Elasticsearch stats aggregation group by date on timeseries

I having some trouble getting a query working. I want to aggregate a weather station's timeseries data in ElasticSearch. I have a value (double) for each day of the year. I would like a query to be able to provide me the sum, min, max of my value field, grouped by month.
My document has a stationid field and a timeseries object array:
}PUT /stations/rainfall/2
{
"stationid":"5678",
"timeseries": [
{
"value": 91.3,
"date": "2016-05-01"
},
{
"value": 82.2,
"date": "2016-05-02"
},
{
"value": 74.3,
"date": "2016-06-01"
},
{
"value": 34.3,
"date": "2016-06-02"
}
]
}
So I am hoping to be able to query this stationid: "5678" or the doc index:2
and see: stationid: 5678, monthlystats: [ month:5, avg:x, sum:y, max:z ]
Many thanks in advance for any help. Also happy to take any advice on my document structure too.

Umlaut in Elastic Suggesters

I am currently trying to set up a suggester similar to the google misspelling correction. I am using the Elastic Suggesters with the following query:
{
"query": {
"match": {
"name": "iphone hüle"
}
},
"suggest": {
"suggest_name": {
"text": "iphone hüle",
"term": {
"field": "name"
}
}
}
}
It results the following suggestions:
"suggest": {
"suggest_name": [
{
"text": "iphone",
"offset": 0,
"length": 6,
"options": []
},
{
"text": "hule",
"offset": 7,
"length": 4,
"options": [
{
"text": "hulle",
"score": 0.75,
"freq": 162
},
...
{
"text": "hulk",
"score": 0.75,
"freq": 38
}
]
}
]
}
Now the problem I have is in the returned text inside the options and inside the suggest. The text I submitted and the returned text should be "hüle" not "hule". Furthermore the returned option text should actually be "hülle" and not "hulle". As I use the same fields for the query and the suggester I wonder why the umlauts are only missing in the suggester and not in the regular query results.
See a query result here:
"_source": {
...
"name": "Ladegerät für iPhone",
"manufacturer": "Apple",
}
The data you get back in your query result, i.e.
"name": "Ladegerät für iPhone"
is the stored content of the field. It is exactly your source data. Search and obviously also the suggester, however, work on the inverted index, which contains tokens massaged by the analyzer. You are most likely using an analyzer that folds umlauts.
Strange enough I discussed this with a colleague yesterday. We came to the conclusion that we may need a separate field, indexed and not stored, into which we index the non-normalized tokens. We want to use it to fetch suggestion terms. In addition it may be a feature that we can perform exact searches on it, i.e. searches which do make a difference between Müller and Mueller, Foto and Photo, Rene and René.

ElasticSearch filtering mulitple documents with same term

I'm not sure how to describe this query so don't know what to look for in the documentation. I will try and demonstrate with a made up example.
You have an inventory of electronic devices with serial numbers
"serial": "xyz"
they all have a status e.g
"status": "faulty"
or
"status": "repaired"
There can be multiple documents with the same serial number. E.g.
{
"serial": "xyz"
"status": "faulty"
"date": 01-01-2015
}
and then another doc at a later date
{
"serial": "xyz"
"status": "repaired"
"date": 01-02-2015
}
So i want to search my index to show me all serial numbers where there exists a document with status "faulty" AND a document with status "repaired". What is the type of query for that?
just curious to check if this query satisfies your need:
{
"query": {
"query_string": {
"query": "status:faulty OR status:repaired"
}
}
}

Resources