I have a csv file containing data as described below :
- Data1|data2|data3....
- Data4|data5|data6....
- Ctr|1|2
- Lst|1|30
- Lst|1|40
- Lst|1|50
- Data7|data8....
- Ctr|2|3
- Lst|2|60
- Lst|2|70
I have a table controle ( data_type varchar,
Id_control varchar,
Type_liste varchar,
Id_subcontrol varchar)
I am using sql loader to fill the table, the result I expect is :
- data_type | Id_control|Type_liste| Id_subcontrol
- Ctrl | 1 | NULL | 2
- Lst | 1 | 30 | NULL
- Lst | 1 | 40 | NULL
- Lst | 1 | 50 | NULL
- Ctr | 2 | NULL | 3
- Lst | 2 | 60 | NULL
- Lst | 2 | 70 | NULL
I've tried this but the second part return 0 rows loaded
LOAD DATA
CHARACTERSET UTF8
TRUNCATE
INTO TABLE controle
WHEN (1:4) = 'Ctrl|'
FIELDS TERMINATED BY "|"
TRAILING NULLCOLS
(
Data_type CHAR,
Id_control CHAR,
Id_subcontrol CHAR
)
INTO TABLE controle
WHEN (1:3) = 'Lst'
FIELDS TERMINATED BY "|"
TRAILING NULLCOLS
(
Data_type CHAR,
Id_control CHAR,
type_list CHAR
)
any idea please ?
Thanks in advance.
As the docs say:
A key point when using multiple INTO TABLE clauses is that field
scanning continues from where it left off when a new INTO TABLE clause
is processed. The remainder of this section details important ways to
make use of that behavior. It also describes alternative ways of using
fixed field locations or the POSITION parameter.
So when processing the Lst condition, it's continuing to look for columns on the current row.
You can reset this by defining the first field with position to reset to the start of the line:
LOAD DATA
INFILE *
TRUNCATE
INTO TABLE controle
WHEN Data_type = 'Ctr'
FIELDS TERMINATED BY "|"
TRAILING NULLCOLS
(
Data_type CHAR,
Id_control CHAR,
Id_subcontrol CHAR
)
INTO TABLE controle
WHEN Data_type = 'Lst'
FIELDS TERMINATED BY "|"
TRAILING NULLCOLS
(
Data_type POSITION(1:3) CHAR,
Id_control CHAR,
type_list CHAR
)
BEGINDATA
Data1|data2|data3
Data4|data5|data6
Ctr|1|2
Lst|1|30
Lst|1|40
Lst|1|50
Data7|data8|data9
Ctr|2|3
Lst|2|60
Lst|2|70
I have this function:
CREATE OR REPLACE FUNCTION array_drop_null(anyarray)
RETURNS anyarray
AS $$
BEGIN
RETURN array(SELECT x FROM UNNEST($1) x WHERE x IS NOT NULL);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
I can get a correct behavior with non-empty array:
asset=# select 1 FROM array_drop_null(ARRAY[1,2,NULL]::int[] ) x WHERE ARRAY_LENGTH(x, 1) = 2;
?column?
----------
1
(1 row)
However, when I pass an empty array or NULL, I got this:
asset=# select array_drop_null('{}'::int[] )
asset-# ;
ERROR: plpgsql functions cannot take type anyarray
CONTEXT: compile of PL/pgSQL function "array_drop_null" near line 0
asset=# SELECT '{}'::INT[];
int4
------
{}
(1 row)
When I pass a column with TEXT[] to the function, it returns {} if the column entry is NULL.
Another case:
asset=# select 1 FROM array_drop_null(ARRAY[NULL]::int[] ) x WHERE x IS NULL;
?column?
----------
1
(1 row)
It doesn't return {}, instead it returns NULL.
I am confused about the behavior. Can someone explain what is going on? What is a proper way to pass an empty array or NULL?
I got an error with the message "ORA-01722: invalid number" when I executed my query. As digged in, I found out, I will get that error when I use a numeric value for a column that expects a varchar.
In my case my query had a case statement like
CASE WHEN CHAR_COLUMN= 1 THEN 'SOME VALUE' END
But the behavior was different in different instances of same oracle version. The same query worked in one of our dev oracle server, but not in the other. I am curious if there is any configuration that allows oracle to use numeric value to be used in a character column.
PS: The oracle version that we are using is Oracle Database 11g Express Edition Release 11.2.0.2.0 - 64bit Production
I never rely on implicit data type conversion but always make it explicit:
CASE WHEN CHAR_COLUMN = TO_CHAR(1) THEN 'SOME VALUE' END
Or even not convert at all:
CASE WHEN CHAR_COLUMN = '1' THEN 'SOME VALUE' END
The reason is that Oracle tends to convert the character string to a number, not the other way round. See the example from the linked manual page:
Text Literal Example
The text literal '10' has data type CHAR. Oracle implicitly converts it to the NUMBER data type if it appears in a numeric expression as in the following statement:
SELECT salary + '10'
FROM employees;
To reproduce the issue:
create table foo (
CHAR_COLUMN varchar2(10)
);
create table bar (
CHAR_COLUMN varchar2(10)
);
insert into foo (CHAR_COLUMN) values ('1');
insert into foo (CHAR_COLUMN) values ('2');
insert into bar (CHAR_COLUMN) values ('1');
insert into bar (CHAR_COLUMN) values ('yellow');
Then, when querying against numeric 1, the first query table works and the second doesn't:
select * from foo where CHAR_COLUMN = 1;
select * from bar where CHAR_COLUMN = 1;
When you ask Oracle to resolve this comparison:
WHEN CHAR_COLUMN = 1
... Oracle converts the query internally to:
WHEN TO_NUMBER(CHAR_COLUMN) = 1
In my example, this can be spotted in the explain plan:
---------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 3 | 00:00:01 |
| * 1 | TABLE ACCESS FULL | FOO | 1 | 7 | 3 | 00:00:01 |
---------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 1 - filter(TO_NUMBER("CHAR_COLUMN")=1)
I am trying to COPY FROM a csv file, I have both 1 Timestamp and a time column.
Trying to test with a couple of rows to begin with:
cqlsh:tests> CREATE TABLE testts (
... ID int PRIMARY KEY,
... mdate timestamp,
... ttime time);
cqlsh:tests> INSERT INTO testts (ID , mdate, ttime )
... VALUES (1, '2015-10-12', '1055') ;
cqlsh:tests> INSERT INTO testts (ID , mdate, ttime )
... VALUES (2, '2014-06-25', '920') ;
cqlsh:tests> select * from testts;
id | mdate | ttime
----+--------------------------+--------------------
1 | 2015-10-12 07:00:00+0000 | 00:00:00.000001055
2 | 2014-06-25 07:00:00+0000 | 00:00:00.000000920
(2 rows)
The above works, now I try the import file
cqlsh:tests> COPY testts ( ID,
... mdate,
... ttime)
... FROM 'c:\cassandra228\testtime.csv' WITH HEADER = FALSE AND DELIMITER = ',' AND DATETIMEFORMAT='%Y/%m/%d';
Using 3 child processes
Starting copy of tests.testts with columns [id, mdate, ttime].
Failed to import 1 rows: ParseError - Failed to parse 1130 : can't interpret '1130' as a time, given up without retries
Failed to import 1 rows: ParseError - Failed to parse 1230 : can't interpret '1230' as a time, given up without retries
Failed to import 1 rows: ParseError - Failed to parse 930 : can't interpret '930' as a time, given up without retries
Failed to process 3 rows; failed rows written to import_tests_testts.err
Processed: 3 rows; Rate: 0 rows/s; Avg. rate: 1 rows/s
3 rows imported from 1 files in 3.269 seconds (0 skipped).
My Timestamp coulmn is formatted YYYY/MM/DD , till I gave this DATETIMEFORMAT='%Y/%m/%d' I would get error on the timestamp column but after this that error stopped.
CSV file:
3,2010/02/08,930
4,2015/05/20,1130
5,2016/08/15,1230
How do I fix this.
Thanks much
I have check this with same schema and data with cassandra-2.2.4's cqlsh
All the value are inserted without any Error.
But with cassandra-2.2.8's cqlsh,It's giving me the same Error as yours.
You can fix the with small change in the cqlsh code.
1 . Open the copyutil.py file. In my case it was /opt/apache-cassandra-2.2.8/pylib/cqlshlib/copyutil.py
2 . Find the method convert_time() and changed it to this
def convert_time(v, **_):
try:
return Time(int(v))
except ValueError, e:
pass
return Time(v)
Kindest,
I am trying to extend some working HIVE queries, but seem to fall short. Just wanting to test GROUP BY function, which is common to a number of queries that I need to complete. Here is the query that I am trying to execute:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp)) AS hourly
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hourly;
This yields the following error:
ERROR: Error while executing Hive script.Query returned non-zero code:
10, cause: FAILED: Error in semantic analysis: Line 1:320 Invalid
table alias or column reference 'hourly': (possible column names are:
messagerowid, payload_sensor, messagetimestamp, payload_temp,
payload_timestamp, payload_timestampmysql, payload_watt,
payload_wattseconds)
The intention is to end up with a timebound query (say last 24 hours) broken in by SUM() on payload_wattsecond etc. To get started breaking then out the creation of the summary tables, I started building a group by query which were going to derive the hourly anchor for the select query.
Problem though is the error above. Would greatly appreciate any pointers to what is wrong here.. can't seem to find it myself, but then again I am a newbie on HIVE.
Thanks in advance ..
UPDATE: Tried to update the query. Here is the query that I just tried to run:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp))
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hour(from_unixtime(payload_timestamp));
.. that however gives another error, which is:
ERROR: Error while executing Hive script.Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Line 1:7 Expression not in GROUP BY key 'messageRowID'
Thoughts?
UPDATE #2) The following is a quick dump of a few samples that are derived into the EVENT_KS CF in WSO2BAM. The last column is a calculated (in the perl daemon..) #watt_seconds, which will be used in a query to calculate the aggregate sum totalled into kwH, which then will be dumped into MySQL tables for sync to the application that holds the ui/ux layer..
[12:03:00] [jskogsta#enterprise ../Product Centric Opco Modelling]$ ~/local/apache-cassandra-2.0.8/bin/cqlsh localhost 9160 -u admin -p admin --cqlversion="3.0.5"
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 1.2.13 | CQL spec 3.0.5 | Thrift protocol 19.36.2]
Use HELP for help.
cqlsh> use "EVENT_KS";
cqlsh:EVENT_KS> select * from currentcost_stream limit 5;
key | Description | Name | Nick_Name | StreamId | Timestamp | Version | payload_sensor | payload_temp | payload_timestamp | payload_timestampmysql | payload_watt | payload_wattseconds
-------------------------------------------+---------------------------+--------------------+----------------------+---------------------------+---------------+---------+----------------+--------------+-------------------+------------------------+--------------+---------------------
1403365575174::10.11.205.218::9443::9919 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403365575174 | 1.0.18 | 1 | 13.16 | 1403365575 | 2014-06-21 23:46:15 | 6631 | 19893
1403354553932::10.11.205.218::9443::2663 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403354553932 | 1.0.18 | 1 | 14.1 | 1403354553 | 2014-06-21 20:42:33 | 28475 | 0
1403374113341::10.11.205.218::9443::11852 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403374113341 | 1.0.18 | 1 | 10.18 | 1403374113 | 2014-06-22 02:08:33 | 17188 | 154692
1403354501924::10.11.205.218::9443::1894 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403354501924 | 1.0.18 | 1 | 10.17 | 1403354501 | 2014-06-21 20:41:41 | 26266 | 0
1403407054092::10.11.205.218::9443::15527 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403407054092 | 1.0.18 | 1 | 17.16 | 1403407054 | 2014-06-22 11:17:34 | 6332 | 6332
(5 rows)
cqlsh:EVENT_KS>
What I will be trying to do is to issue a query against this table (actually multiples depending on the various presentation aggregations that are required..), and present a view based on hourly sum's, 10-minute sum's, daily sum's, monthly sum's etc. etc. Depending on the query, the GROUP BY was intended to give this 'index' so to speak. Right now just testing this.. so will see how it ends up in the end. Hope this makes sense?!
So not trying to remove duplicates...
UPDATE 3) Was going about this all wrong.. and thought a bit more on the tip that was given below. Hence just simplifying the whole query gave the right results. The following query gives the total amount of kwH on an hourly basis for the WHOLE dataset. With this, I can create the various iterations of kwH spent over various time periods like
Hourly over the last 24 hours
Daily over the last year
Minute over the last hour
.. etc. etc.
Here is the query:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select hour(from_unixtime(payload_timestamp)) AS hourly, (sum(payload_wattseconds)/(60*60)/1000)
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
GROUP BY hour(from_unixtime(payload_timestamp));
This query yields the following based on the sample data:
hourly _c1
0 16.91570472222222
1 16.363228888888887
2 15.446414166666667
3 11.151388055555556
4 18.10564666666667
5 2.2734924999999997
6 17.370668055555555
7 17.991484444444446
8 38.632728888888884
9 16.001440555555554
10 15.887023888888889
11 12.709341944444445
12 23.052629722222225
13 14.986092222222222
14 16.182284722222224
15 5.881564999999999
18 2.8149172222222223
19 17.484405
20 15.888274166666665
21 15.387210833333333
22 16.088641666666668
23 16.49990916666667
Which is aggregate kwH per hourly timeframe over the entire dataset..
So, now on to the next problem. ;-)