I've had little luck searching for this over a couple days.
If my avro schema for data in a hive table is:
{
"type" : "record",
"name" : "messages",
"namespace" : "com.company.messages",
"fields" : [ {
"name" : "timeStamp",
"type" : "long",
"logicalType" : "timestamp-millis"
}, {
…
and I use presto to query this, I do not get formatted timestamps.
select "timestamp", typeof("timestamp") as type,
current_timestamp as "current_timestamp", typeof(current_timestamp) as current_type
from db.messages limit 1
timestamp type current_timestamp current_type
1497210701839 bigint 2017-06-14 09:32:43.098 Asia/Seoul timestamp with time zone
I thought it would be a non-issue then to convert them to timestamps with millisecond precision, but I'm finding I have no clear way to do that.
select cast("timestamp" as timestamp) from db.messages limit 1
line 1:16: Cannot cast bigint to timestamp
Also they've changed presto's timestamp casting to always assume the source is in seconds.
https://issues.apache.org/jira/browse/HIVE-3454
So if I used from_unixtime() I have to chop off the milliseconds or else it gives me a very distant date:
select from_unixtime("timestamp") as "timestamp" from db.messages limit 1
timestamp
+49414-08-06 07:15:35.000
Surely someone else who works with Presto more often knows how to express the conversion properly. (I can't restart the Presto nor Hive servers to force the timezone into UTC either btw).
I didn't find direct conversion from Java timestamp (number of milliseconds since 1970) to timestamp, but one can be done with to_unixtime and adding milliseconds as interval:
presto> with t as (select cast('1497435766032' as bigint) a)
-> select from_unixtime(a / 1000) + parse_duration(cast((a % 1000) as varchar) || 'ms') from t;
_col0
-------------------------
2017-06-14 12:22:46.032
(1 row)
(admittedly cumbersome, but works)
select from_unixtime(cast(event_time as bigint) / 1000000) + parse_duration(cast((cast(event_time as bigint) % 1000) as varchar) || 'ms') from TableName limit 10;
Related
I want to do a query in nosql database using the BETWEEN operation ? I want to do something like this:
SELECT * FROM table
WHERE column_timestamp BETWEEN '2021-01-01'
AND '2021-12-31'
How to do this query in NoSQL Database? Is BETWEEN supported ?
In your case because your column seems to be a timestamp, you need to cast the timestamp values to TIMESTAMP type. Then you'd just use <= and >= operators.
Be careful, when doing queries with only dates w/o providing the time. Here a testcase using <, <=, >=, >.
I've inserted 3 rows :
2021-12-01T00:00:00Z
2021-12-31T00:00:00Z
2021-12-31T23:59:59Z
SELECT * FROM TEST
where date > CAST("2021-12-01" as timestamp) and date < CAST("2021-12-31" as timestamp)
no rows selected
SELECT * FROM TEST
where date >= CAST("2021-12-01" as timestamp) and date <= CAST("2021-12-31" as timestamp)
2 rows : 2021-12-01T00:00:00Z and 2021-12-31T00:00:00Z
SELECT * FROM TEST
where date >= CAST("2021-12-01T00:00:00" as timestamp)and date <= CAST("2021-12-31T23:59:59" as timestamp)
3 rows : 2021-12-01T00:00:00Z , 2021-12-31T00:00:00Z and 2021-12-31T23:59:59Z
SELECT * FROM TEST
where date >= CAST("2021-12-01" as timestamp) and date < CAST("2022-01-01" as timestamp)
3 rows : 2021-12-01T00:00:00Z , 2021-12-31T00:00:00Z and 2021-12-31T23:59:59Z
SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).
I have a string 2013-01-01 12:00:01.546 which represents a timestamp with milliseconds that I need to convert to a bigint without losing the milliseconds.
I tried unix_timestamp but I lose the milliseconds:
unix_timestamp(2013-01-01 12:00:01.546,'yyyy-MM-dd HH:mm:ss') ==> 1357059601
unix_timestamp(2013-01-01 12:00:01.786,'yyyy-MM-dd HH:mm:ss') ==> 1357059601
I tried with milliseconds format as well but no difference
unix_timestamp(2013-01-01 12:00:01.786,'yyyy-MM-dd HH:mm:ss:SSS') ==> 1357059601
Is there any way to get milliseconds difference in hive?
This is what I came with so far.
If all your timestamps have a fraction of 3 digits it can be simplified.
with t as (select timestamp '2013-01-01 12:00:01.546' as ts)
select cast ((to_unix_timestamp(ts) + coalesce(cast(regexp_extract(ts,'\\.\\d*',0) as decimal(3,3)),0)) * 1000 as bigint)
from t
1357070401546
Verification of the result:
select from_utc_timestamp (1357070401546,'UTC')
2013-01-01 12:00:01.546000
So apparently unix_timestamp doesn't convert milliseconds. You can use the following approach.
hive> select unix_timestamp(cast(regexp_replace('2013-01-01 12:00:01.546', '(\\d{4})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})', '$1-$2-$3 $4:$5:$6.$7' ) as timestamp));
OK
1357063201
Hive function unix_timestamp() doesn't convert the milli second part, so you may want to use the below:
unix_timestamp('2013-01-01 12:00:01.546') + cast(split('2013-01-01 12:00:01.546','\\\.')[1] as int) => 1357067347
unix_timestamp('2013-01-01 12:00:01.786') + cast(split('2013-01-01 12:00:01.786','\\\.')[1] as int) => 1357067587
when querying a table which has a column "timestamp" (epoch timestamp, UTC, milliseconds, type integer in the bigquery table)
i want to be able to say:
timestamp between one_week_ago and now
without specifying the exact timestamps in each query.
i should add that i know the following working query:
WITH timerange AS
(SELECT *,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR) AS one_week_ago,
CURRENT_TIMESTAMP() AS now,
TIMESTAMP_SECONDS(timestamp) AS measurement_time
FROM table_name),
grouped AS
(SELECT field1, field2, count(*) count
FROM timerange
WHERE measurement_time BETWEEN one_week_ago AND now
GROUP BY field1, field2
)
SELECT * FROM timerange
WHERE field2 = "example"
but why am i not simply able to say:
timestamp between function_call1 and function_call2
?
these are examples of the timestamps: 1491544587, 1491422047, 1491882866, 1491881903 1491436515, 1491436771, 1491436593, 1491436621, 1491436390, 1491436334
https://cloud.google.com/bigquery/docs/reference/legacy-sql
https://cloud.google.com/bigquery/docs/reference/standard-sql/
You can certainly say in Standard SQL like you want:
SELECT *
FROM table
WHERE TIMESTAMP_SECONDS(timestamp) BETWEEN
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR)
AND CURRENT_TIMESTAMP()
The documentation says that timestamps support the following conversion:
•Floating point numeric types: Interpreted as UNIX timestamp in seconds with decimal precision
First of all, I'm not sure how to interpret this. If I have a timestamp 2013-01-01 12:00:00.423, can I convert this to a numeric type that retains the milliseconds? Because that is what I want.
More generally, I need to do comparisons between timestamps such as
select maxts - mints as latency from mytable
where maxts and mints are timestamp columns. Currently, this gives me NullPointerException using Hive 0.11.0. I am able to perform queries if I do something like
select unix_timestamp(maxts) - unix_timestamp(mints) as latency from mytable
but this only works for seconds, not millisecond precision.
Any help appreciated. Tell me if you need additional information.
If you want to work with milliseconds, don't use the unix timestamp functions because these consider date as seconds since epoch.
hive> describe function extended unix_timestamp;
unix_timestamp([date[, pattern]]) - Returns the UNIX timestamp
Converts the current or specified time to number of seconds since 1970-01-01.
Instead, convert the JDBC compliant timestamp to double.
E.g:
Given a tab delimited data:
cat /user/hive/ts/data.txt :
a 2013-01-01 12:00:00.423 2013-01-01 12:00:00.433
b 2013-01-01 12:00:00.423 2013-01-01 12:00:00.733
CREATE EXTERNAL TABLE ts (txt string, st Timestamp, et Timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/hive/ts';
Then you may query the difference between startTime(st) and endTime(et) in milliseconds as follows:
select
txt,
cast(
round(
cast((e-s) as double) * 1000
) as int
) latency
from (select txt, cast(st as double) s, cast(et as double) e from ts) q;