How to import JSON with specifying which fields in the JSON of time type? - rethinkdb

I'm using such command to import data to RethinkDB
rethinkdb import --force -f ${folder}/json/data.json --table test.data -c localhost:28015
It imports data perfectly. But I have some of fields in my json as time:
{
"id": "1",
"date": "2015-09-19",
"time": {
"begin": "09:00",
"end": "10:30"
}
}
When I'm trying to query these fields like data or time.begin, time.end treating them as time - RethinkDB doesn't understand it and throw exception
r.db('test').table('data').filter(function(t) {
return t("date").date()
})
RqlRuntimeError: Not a TIME pseudotype: `"2015-09-19"` in:
r.db("test").table("data").filter(function(var_43) { return var_43("date").date(); })
^^^^^^^^^^^^^^
Is any way to specify for RethinkDB which field in the JSON are with time type?

JSON doesn't provide a standard way of specifying a time field, but there are a couple ways you can do this with RethinkDB: either modify the data before or after inserting it. RethinkDB time objects are more than just the strings you have shown here, and contain millisecond time resolution along with timezone data.
Time objects can be constructed using r.now(), r.time(), r.epoch_time(), and r.ISO8601(). Because of the format of your time strings, I would use r.ISO8601(). It is important to note that your data doesn't appear to contain timezone information, so you should be sure that your data won't return incorrect results if they are all put in the same timezone.
Another thing to keep in mind when using times in RethinkDB is that the data will be converted into an appropriate time object in your client. Since it appears that you are using Javascript, you will get back a Date object. For Python, you would get a datetime.datetime object, etc. If you would rather get the raw time pseudotype format (see below), you can specify timeFormat: "raw" as a global optarg to your query (see the documentation for run() for details).
Post-process the data inside RethinkDB
This is probably the easiest option, and what I would recommend. After importing your data, you can run a query to modify each row to convert the strings into time objects. Based on the format of your data, this should work:
r.db('test').table('data').replace(function(row) {
return row.merge({
'begin_time': r.ISO8601(row('date').add('T').add(row('time')('begin')), { defaultTimezone: '+00:00' }),
'end_time': r.ISO8601(row('date').add('T').add(row('time')('end')), { defaultTimezone: '+00:00' })
}).without('date', 'time');
}).run(conn, callback)
This replaces the date and time fields from all the rows in your test.data table with begin_time and end_time time objects that can be used as you expect. The defaultTimezone field is required because the time string doesn't contain timezone information, but you should change these values to whatever is appropriate.
Modify the JSON data
This is a bit lower-level and can be tricky, but if you don't mind getting your hands dirty, this could be more suited to your needs.
RethinkDB time objects are communicated in JSON using a particular format to represent a 'pseudotype'. These are types not standardized in JSON that still exist in RethinkDB. The format for a time pseudotype looks like this:
{
"$reql_type$": "TIME",
"epoch_time": 1413843783.195,
"timezone": "+00:00"
}
Where epoch_time is the number of seconds since the UNIX epoch (Jan 1, 1970). If the data you are importing follows this format, you can insert this directly and it will be interpreted by the database as a valid time object. It would be up to you to modify the data you are importing, but your example row would look something like this:
{
"id": "1",
"begin_time": {
"$reql_type$": "TIME",
"epoch_time": 1442653200,
"timezone": "+00:00"
},
"end_time': {
"$reql_type$": "TIME",
"epoch_time": 1442658600,
"timezone": "+00:00"
}
}
My same caveat for timezones applies here as well.

Related

Get date value in update query elasticsearch painless

I'm trying to get millis value of two dates and subtract them to another.
When I used ctx._sourse.begin_time.toInstant().toEpochMilli() like doc['begin_time'].value.toInstant().toEpochMilli() it gives me runtime error.
And ctx._source.begin_time.date.getYear() (like this Update all documents of Elastic Search using existing column value) give me runtime error with message
"ctx._source.work_time = ctx._source.begin_time.date.getYear()",
" ^---- HERE"
What type I get with ctx._source if this code works correctly doc['begin_time'].value.toInstant().toEpochMilli().
I can't find in documentation of painless how to get values correctly. begin_time is date 100%.
So, how can I write a script to get the difference between two dates and write it to another integer?
If you look closely, the script language from the linked question is in groovy but it's not supported anymore. What we use nowadays (2021) is called painless.
The main point here is that the ctx._source attributes are the original JSON -- meaning the dates will be strings or integers (depending on the format) and not java.util.Date or any other data type that you could call .getDate() on. This means we'll have to parse the value first.
So, assuming your begin_time is of the format yyyy/MM/dd, you can do the following:
POST myindex/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy/MM/dd");
LocalDate begin_date = LocalDate.parse(ctx._source.begin_time, dtf);
ctx._source.work_time = begin_date.getYear()
"""
}
}
BTW the _update_by_query script context (what's accessible and what's not) is documented here and working with datetime in painless is nicely documented here.

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

Protocol buffers Fieldmask on Collections within resource

If I want to update the "amount" field within a particular element inside "f_units" collection in the below resource (protocol buffer), how will the FieldMask look like to update the amount field? Does the field mask operate on array index for collections?
{
"f_sel": {
"f_units": [
{
"id": "1",
"amount": {
"coefficient": 1000,
"exponent": -2
}
},
{
"id": "2",
"amount": {
"coefficient": 2000,
"exponent": -2
}
}
]
}
}
Will it be "f_sel.f_units.0.amount" ? How can I update the amount using FieldMask?
As far as I know, there is no way to replace individual elements of a repeated field with an index in a FieldMask.
Instead, you'd update the amount field for the element within f_units you wish to change and set the FieldMask to
"f_sel.f_units"
It would be slightly more efficient to only have to send a delta to the original list, but it would be hard to prevent bugs. For example, what if the proto was modified in the meantime and the specified index (presuming there was a way to specify one) for the repeated field was no longer in range?
As an aside, Google does propose the concept of MergeOptions which defines semantics for how repeated fields are to be handled when merging. Currently, it appears they intend for you either to replace the repeated field in its entirety or append to the end of the destination field. Both of these merging strategies avoid the aforementioned bug that could be caused by specifying an invalid index.

how to change elasticserach query result type?

I saved a type of datetime data to ES, in the search results, this field type was converted into a timestamp(integer), is there any way to turn into a string(just by modifying the query parameters)?
You can specify fields in the query then elasticsearch returns the fields in the format that you originally stored it:
You have two options ,
You can specify the date format at index time and return the same.
You can use scripts to format the date in the format you need.
curl -XGET http://localhost:9200/myindex/test/_search?pretty -d '
{
"query":{
"match_all":{ }
},
"script_fields":{
"aDate":{
"script":"if (!_source.myDate?.equals('null')) new java.text.SimpleDateFormat('yyyy-MM-dd\\'T\\'HH:mm:ss').format(new java.util.Date(_source.myDate));"
}
}
}'
I would choose the firat one as scripts are generally a lot more expensive.

Kibana 4.1 - use JSON input to create an Hour Of Day field from #timestamp for histogram

Edit: I found the answer, see below for Logstash <= 2.0 ===>
Plugin created for Logstash 2.0
Whomever is interested in this with Logstash 2.0 or above, I created a plugin that makes this dead simple:
The GEM is here:
https://rubygems.org/gems/logstash-filter-dateparts
Here is the documentation and source code:
https://github.com/mikebski/logstash-datepart-plugin
I've got a bunch of data in Logstash with a #Timestamp for a range of a couple of weeks. I have a duration field that is a number field, and I can do a date histogram. I would like to do a histogram over hour of day, rather than a linear histogram from x -> y dates. I would like the x axis to be 0 -> 23 instead of date x -> date y.
I think I can use the JSON Input advanced text input to add a field to the result set which is the hour of day of the #timestamp. The help text says:
Any JSON formatted properties you add here will be merged with the elasticsearch aggregation definition for this section. For example shard_size on a terms aggregation which leads me to believe it can be done but does not give any examples.
Edited to add:
I have tried setting up an entry in the scripted fields based on the link below, but it will not work like the examples on their blog with 4.1. The following script gives an error when trying to add a field with format number and name test_day_of_week: Integer.parseInt("1234")
The problem looks like the scripting is not very robust. Oddly enough, I want to do exactly what they are doing in the examples (add fields for day of month, day of week, etc...). I can get the field to work if the script is doc['#timestamp'], but I cannot manipulate the timestamp.
The docs say Lucene expressions are allowed and show some trig and GCD examples for GIS type stuff, but nothing for date...
There is this update to the BLOG:
UPDATE: As a security precaution, starting with version 4.0.0-RC1,
Kibana scripted fields default to Lucene Expressions, not Groovy, as
the scripting language. Since Lucene Expressions only support
operations on numerical fields, the example below dealing with date
math does not work in Kibana 4.0.0-RC1+ versions.
There is no suggestion for how to actually do this now. I guess I could go off and enable the Groovy plugin...
Any ideas?
EDIT - THE SOLUTION:
I added a filter using Ruby to do this, and it was pretty simple:
Basically, in a ruby script you can access event['field'] and you can create new ones. I use the Ruby time bits to create new fields based on the #timestamp for the event.
ruby {
code => "ts = event['#timestamp']; event['weekday'] = ts.wday; event['hour'] = ts.hour; event['minute'] = ts.min; event['second'] = ts.sec; event['mday'] = ts.day; event['yday'] = ts.yday; event['month'] = ts.month;"
}
This no longer appears to work in Logstash 1.5.4 - the Ruby date elements appear to be unavailable, and this then throws a "rubyexception" and does not add the fields to the logstash events.
I've spent some time searching for a way to recover the functionality we had in the Groovy scripted fields, which are unavailable for scripting dynamically, to provide me with fields such as "hourofday", "dayofweek", et cetera. What I've done is to add these as groovy script files directly on the Elasticsearch nodes themselves, like so:
/etc/elasticsearch/scripts/
hourofday.groovy
dayofweek.groovy
weekofyear.groovy
... and so on.
Those script files contain a single line of Groovy, like so:
Integer.parseInt(new Date(doc["#timestamp"].value).format("d")) (dayofmonth)
Integer.parseInt(new Date(doc["#timestamp"].value).format("u")) (dayofweek)
To reference these in Kibana, firstly create a new search and save it, or choose one of your existing saved searches (Please take a copy of the existing JSON before you change it, just in case) in the "Settings -> Saved Objects -> Searches" page. You then modify the query to add "Script Fields" in, so you get something like this:
{
"query" : {
...
},
"script_fields": {
"minuteofhour": {
"script_file": "minuteofhour"
},
"hourofday": {
"script_file": "hourofday"
},
"dayofweek": {
"script_file": "dayofweek"
},
"dayofmonth": {
"script_file": "dayofmonth"
},
"dayofyear": {
"script_file": "dayofyear"
},
"weekofmonth": {
"script_file": "weekofmonth"
},
"weekofyear": {
"script_file": "weekofyear"
},
"monthofyear": {
"script_file": "monthofyear"
}
}
}
As shown, the "script_fields" line should fall outside the "query" itself, or you will get an error. Also ensure the script files are available to all your Elasticsearch nodes.

Resources