Cassandra COPY FROM Time and Timestamp columns - time

I am trying to COPY FROM a csv file, I have both 1 Timestamp and a time column.
Trying to test with a couple of rows to begin with:
cqlsh:tests> CREATE TABLE testts (
... ID int PRIMARY KEY,
... mdate timestamp,
... ttime time);
cqlsh:tests> INSERT INTO testts (ID , mdate, ttime )
... VALUES (1, '2015-10-12', '1055') ;
cqlsh:tests> INSERT INTO testts (ID , mdate, ttime )
... VALUES (2, '2014-06-25', '920') ;
cqlsh:tests> select * from testts;
id | mdate | ttime
----+--------------------------+--------------------
1 | 2015-10-12 07:00:00+0000 | 00:00:00.000001055
2 | 2014-06-25 07:00:00+0000 | 00:00:00.000000920
(2 rows)
The above works, now I try the import file
cqlsh:tests> COPY testts ( ID,
... mdate,
... ttime)
... FROM 'c:\cassandra228\testtime.csv' WITH HEADER = FALSE AND DELIMITER = ',' AND DATETIMEFORMAT='%Y/%m/%d';
Using 3 child processes
Starting copy of tests.testts with columns [id, mdate, ttime].
Failed to import 1 rows: ParseError - Failed to parse 1130 : can't interpret '1130' as a time, given up without retries
Failed to import 1 rows: ParseError - Failed to parse 1230 : can't interpret '1230' as a time, given up without retries
Failed to import 1 rows: ParseError - Failed to parse 930 : can't interpret '930' as a time, given up without retries
Failed to process 3 rows; failed rows written to import_tests_testts.err
Processed: 3 rows; Rate: 0 rows/s; Avg. rate: 1 rows/s
3 rows imported from 1 files in 3.269 seconds (0 skipped).
My Timestamp coulmn is formatted YYYY/MM/DD , till I gave this DATETIMEFORMAT='%Y/%m/%d' I would get error on the timestamp column but after this that error stopped.
CSV file:
3,2010/02/08,930
4,2015/05/20,1130
5,2016/08/15,1230
How do I fix this.
Thanks much

I have check this with same schema and data with cassandra-2.2.4's cqlsh
All the value are inserted without any Error.
But with cassandra-2.2.8's cqlsh,It's giving me the same Error as yours.
You can fix the with small change in the cqlsh code.
1 . Open the copyutil.py file. In my case it was /opt/apache-cassandra-2.2.8/pylib/cqlshlib/copyutil.py
2 . Find the method convert_time() and changed it to this
def convert_time(v, **_):
try:
return Time(int(v))
except ValueError, e:
pass
return Time(v)

Related

Oracle SQL*Loader WHEN Clause Raising Error 2

I am trying to exclude the last line of a data file using SQL*Loader, using the WHEN clause, but when it gets to that line it populates both the bad and discard file, and raises an Error 2.
The line to ignore is the last line and starts with "NOL".
After some reading, Error 2 is a warning about the synatx of the CTL file, but cannot find out where it is wrong. Note, if l remove the last line and then run the SAME CTL file, no ERROR is raised, so the issue cannot be the synatx of the CTL file.
To resolve the issue, l am removing the last line BEFORE loading the data, but would like to find out what the issue is for any future use of the WHEN clause.
I have tried:
file_dt != 'NOL'
(1:1) != 'N'
. . .
But l get the same Error 2.
Has anybody else come across this issue? Or have something that l can try?
Oracle Docs
SQL*Loader Command-Line Reference
For UNIX, the exit codes are as follows:
EX_SUCC 0
EX_FAIL 1
EX_WARN 2
EX_FTL 3
Data:
File-Date,Number
2021-05-04,24
2021-05-04,24
2021-05-04,24
2021-05-04,24
NOL: 4
CTL File:
OPTIONS (READSIZE=51200001, BINDSIZE=51200000, ROWS=5000, ERRORS=0, SKIP=1)
load data
append
into table SOME_SCHEMA.SOME_TABLE
WHEN (01) 'NOL'
fields terminated by ';'
OPTIONALLY ENCLOSED BY '"' AND '"'
trailing nullcols
(
file_dt DATE "YYYY-MM-DD",
a_number
)
Result:
Path used: Conventional
Commit point reached - logical record count 5
Table SOME_SCHEMA.SOME_TABLE:
4 Rows successfully loaded.
Check the log file:
loading-file.log
for more information about the load.
2021-05-06 09:42:12: Finished Loading Data into Table
2021-05-06 09:42:12: Status: 2
I'm not on Unix. Nonetheless, loading should work the same.
Table:
SQL> desc test
Name Null? Type
----------------------------------------------------- -------- --------------------------
FILE_DT DATE
A_NUMBER NUMBER
SQL>
Control file (I included sample data into it, for simplicity):
OPTIONS (READSIZE=51200001, BINDSIZE=51200000, ROWS=5000, ERRORS=0, SKIP=1)
load data
infile *
replace
into table test
WHEN (01) <> 'NOL'
fields terminated by ','
trailing nullcols
(
file_dt DATE "YYYY-MM-DD",
a_number
)
begindata
File-Date,Number
2021-05-04,24
2021-05-04,24
2021-05-04,24
2021-05-04,24
NOL: 4
Loading session and result:
SQL> $sqlldr scott/tiger#orcl control=test38.ctl log=test38.log
SQL*Loader: Release 11.2.0.1.0 - Production on ╚et Svi 6 11:38:00 2021
Copyright (c) 1982, 2009, Oracle and/or its affiliates. All rights reserved.
Commit point reached - logical record count 4
SQL> select * from test;
FILE_DT A_NUMBER
------------------- ----------
04.05.2021 00:00:00 24
04.05.2021 00:00:00 24
04.05.2021 00:00:00 24
04.05.2021 00:00:00 24
Seems to be OK.
So, what did I do differently?
modified WHEN clause
fields are terminated by comma, not semi-colon
removed superfluous information

PosrgteSQL: the sum of the column values for the period depending on the step (day, month, year)

I want to create a stored procedure or function that returns a sum of the column values for the period depending on the step (day, month, year). For example, I have table with consumption data. It saves data every 15 minutes. I would like to get report for period from 2019-05-01 to 2019-05-10 with step '1 day'. I need to define a daily dataset for each of the days in this interval and get the sum of the values for each day.
Then the procedure returns data to Laravel. Based on this data charts are built.
My code for this moment:
CREATE OR REPLACE FUNCTION "public"."test"("meterid" int4, "started" text, "ended" text, "preiod" text)
RETURNS TABLE("_kwh" numeric, "datetime" timestamp) AS $BODY$BEGIN
RETURN QUERY
SELECT kwh, a_datetime
FROM "public"."consumption"
WHERE meter_id = meterid
AND a_datetime
BETWEEN to_timestamp(started, 'YYYY-MM-DD HH24:MI:SS')
AND to_timestamp(ended, 'YYYY-MM-DD HH24:MI:SS');
END$BODY$
LANGUAGE plpgsql VOLATILE
COST 100
ROWS 1000
I'm using PostgreSQL 10.7.
You can use pg_generate_series(start, end, interval)
More information in: set returning functions
To simulate your situation I created a simple table:
postgres=# create table consumption (kwh int, datetime date);
CREATE TABLE
postgres=# insert into consumption values (10, 2019-01-01);
ERROR: column "datetime" is of type date but expression is of type integer
postgres=# insert into consumption values (10, '2019-01-01');
INSERT 0 1
postgres=# insert into consumption values (2, '2019-01-03');
INSERT 0 1
postgres=# insert into consumption values (24, '2019-03-06');
INSERT 0 1
postgres=# insert into consumption values (30, '2019-03-22');
INSERT 0 1
And made the select with generate_series()
postgres=# SELECT COALESCE(SUM(kwh), 0) AS kwh,
period::DATE
FROM GENERATE_SERIES('2019-01-01','2019-12-31', '1 day'::interval) AS period
LEFT JOIN consumption ON period::DATE=datetime::DATE
GROUP BY 2
kwh | period
-----+------------
0 | 2019-04-17
0 | 2019-05-29
....
0 | 2019-04-06
0 | 2019-04-26
2 | 2019-01-03
0 | 2019-03-15
...
0 | 2019-11-21
0 | 2019-07-24
30 | 2019-03-22
0 | 2019-05-22
0 | 2019-11-19
...

Hive: Exception when using LAG with window function

I'm trying to calculate a time difference between 2 rows and applied the solution from this SO question. However I get an exception:
> org.apache.hive.service.cli.HiveSQLException: Error while compiling
> statement: FAILED: SemanticException Failed to breakup Windowing
> invocations into Groups. At least 1 group must only depend on input
> columns. Also check for circular dependencies. Underlying error:
> Expecting left window frame boundary for function
> LAG((tok_table_or_col time), 1, 0) Window
> Spec=[PartitioningSpec=[partitionColumns=[(tok_table_or_col
> client_id)]orderColumns=[(tok_table_or_col time) ASC
> NULLS_FIRST]]window(type=ROWS, start=1 PRECEDING, end=currentRow)] as
> LAG_window_0 to be unbounded. Found : 1
HiveQL:
SELECT id, loc, LAG(time, 1, 0) OVER (PARTITION BY id, loc ORDER BY time ROWS 1 PRECEDING) - time AS response_time FROM mytable
How to I fix this? What is the issue?
EDIT:
Sample data:
id loc time
0 1 1414250523591
0 1 1414250523655
1 2 1414250523655
1 2 1414250523661
1 3 1414250523661
1 3 1414250523662
And what I want is the difference of time between rows with same id and loc (always pairs of 2).
EDIT2: I should also mention I'm new to hadoop/hive ecosystem.
So as the error said, the window should be unbounded. So I just removed the ROWS clause and now at least it is doing something but it still is wrong. So I just wanted to check what the LAG value actually is:
SELECT id, loc, LAG(time, 1) OVER (PARTITION BY id, loc ORDER BY time) AS lag_col FROM mytable
And I get this as output:
id loc lag_col
1 2 null
1 2 -1
1 3 null
1 3 -1
The null is clear because I removed the default value but why -1? Are the large values in time column leading to somekind of overflow? Column is defined as bigint so it should actually fit without problem but maybe there is a conversion to int during the query?

HIVE returning wrong date

I'm getting some odd results from HIVE when working with dates.
For starters, I'm using Hive 1.2.1000.2.4.0.0-169
I have a table defined (snipped) of the sort:
hive> DESCRIBE proto_hourly;
OK
elem string
protocol string
count bigint
date_val date
hour_id tinyint
# Partition Information
# col_name data_type comment
date_val date
hour_id tinyint
Time taken: 0.336 seconds, Fetched: xx row(s)
hive>
Ok so I have data loaded for the current year. I started noticing some "weirdness" in queries with specific dates but for a pointed example, here's a pretty simple query where i'm just asking for '2016-06-01' but i get back '2016-05-31'...why
hive> SET i="2016-06-01";
hive> with uniq_dates AS (
> SELECT DISTINCT date_val as date_val
> FROM proto_hourly
> WHERE date_val = date(${hiveconf:i}) )
> select * from uniq_dates;
Query ID = hive_20160616154318_a75b3343-a2fe-41a5-b02a-d9cda8695c91
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1465936275203_0023)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.63 s
--------------------------------------------------------------------------------
OK
2016-05-31
Time taken: 6.738 seconds, Fetched: 1 row(s)
hive>
Testing this a bit more, I found that there was one server configured in a different timezone in the cluster. Two of the three nodes were UTC, but one node was still in America/Denver.
I believe what was happening was the Map/Reduce jobs were executing on the server in the different timezone thus giving me the weird data offset issue.
Date 2016-06-01 UTC does indeed equal Date 2016-05-31 America/Denver
Silent TZ conversion...

H2-Database CommandCentre: CSVREAD skips loading the first(!) csv-Line of Data

folks,
H2 skips/drops the FIRST line of the following csv-Dataset ...
and I couldn't find a solution or workaround.
I have already looked through the various H2-tutorials and of course skimmed
the internet ...
Am I the only one (newbie - my "home" is the IBM-Mainframe)
who has such a problem inserting into a H2-database by using CSVREAD?
I expected here in this example the CSVREAD-Utility to insert 5(five!) lines
into the created table "VL01T098".
!!! there is no "Column-Header-Line" in the csv-dataset - I get the data this way only !!!
AJ52B1;999;2013-01-04;2014-03-01;03Z;A
AJ52C1;777;2012-09-03;2012-08-19;03Z;
AJ52B1;;2013-01-04;2014-03-01;;X
AJ52B1;321;2014-05-12;;03Z;Y
AJ52B1;999;;2014-03-01;03Z;Z
And here is my SQL (from the H2-joboutput):
DROP TABLE IF EXISTS VL01T098;
Update count: 0
(0 ms)
CREATE TABLE VL01T098 (
MODELL CHAR(6)
, FZG_STAT CHAR(3)
, ABGABE_DATUM DATE
, VERSAND_DATUM DATE
, FZG_GRUPPE CHAR(3)
, AV_KZ CHAR(1))
AS SELECT * FROM
CSVREAD
('D:\VL01D_Test\LOAD-csv\T098.csv',
null,
'charset=UTF-8 fieldSeparator=; lineComment=#');
COMMIT;
select count(*) from VL01T098;
select * from VL01T098;
MODELL FZG_STAT ABGABE_DATUM VERSAND_DATUM FZG_GRUPPE AV_KZ
AJ52C1 777 2012-09-03 2012-08-19 03Z null
AJ52B1 null 2013-01-04 2014-03-01 null X
AJ52B1 321 2014-05-12 null 03Z Y
AJ52B1 999 null 2014-03-01 03Z Z
(4 rows, 0 ms)
? Where is just the first csv-line gone ... and why is it lost?
Could you please help a H2-newbie ... with some IBM-DB2-experience
Many thanks in advance
Achim
You didn't specify a column list in the CSVREAD function. That means the column list is read from the file, as documented:
If the column names are specified (a list of column names separated
with the fieldSeparator), those are used, otherwise (or if they are
set to NULL) the first line of the file is interpreted as the column
names.

Resources