Hive 0.13.0, partition and buckets - hadoop

I am getting an error message, which is very different from 2 test runs.
I verified data types data and exactly - double value but there is an issue with type cast.
Why does this occur?, please help me to fix
DROP TABLE XXSCM_SRC_SHIPMENTS;
CREATE TABLE IF NOT EXISTS XXSCM_SRC_SHIPMENTS(
INVENTORY_ITEM_ID DOUBLE
,ORDERED_ITEM STRING
,SHIP_FROM_ORG_ID DOUBLE
,QTR_START_DATE STRING
,QTR_END_DATE STRING
,SEQ DOUBLE
,EXTERNAL_SHIPMENTS DOUBLE
-- ,PREV_EXTERNAL_SHIPMENTS DOUBLE
,INTERNAL_SHIPMENTS DOUBLE
--,PREV_INTERNAL_SHIPMENTS DOUBLE
,AVG_SELL_PRICE DOUBLE)
--,PREV_AVG_SELL_PRICE DOUBLE)
COMMENT 'DIMENTION FOR THE SHIPMENTS LOCAL AND GLOBAL'
PARTITIONED BY (ORGANIZATION_CODE STRING, FISCAL_PERIOD STRING)
CLUSTERED BY (INVENTORY_ITEM_ID, ORDERED_ITEM, SHIP_FROM_ORG_ID, QTR_START_DATE, QTR_END_DATE, SEQ)
SORTED BY (INVENTORY_ITEM_ID ASC, ORDERED_ITEM ASC, SHIP_FROM_ORG_ID ASC, QTR_START_DATE ASC, QTR_END_DATE ASC, SEQ ASC)
INTO 256 BUCKETS
STORED AS ORC TBLPROPERTIES("orc.compress"="SNAPPY");
1) Error Fails
SELECT inventory_item_id,ordered_item,ship_from_org_id,qtr_start_date,qtr_end_date,seq,external_shipments FROM supply_chain_pcam.XXSCM_SRC_SHIPMENTS limit 100
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable
2) Got the result successfully
hive -e "set hive.cli.print.header=true;select * from supply_chain_pcam.xxscm_src_shipments limit 100"

There is a problem in SHIP_FROM_ORG_ID field. You will need to re-create the table XXSCM_SRC_SHIPMENTS with the correct data type. Hive is not able to parse that field. You almost had the answer - if
select *
is fetching the result then try-out individual fields - Here it says a cast exception to double so you can take only the double fields.

Related

Limit select count subquery work in 21.4.5.46 version but can not work in 21.10.2.15

Limit select count subquery work in 21.4.5.46 version but can not work in 21.10.2.15
Sql is
select * from mytable order by sid limit (select toInt64(count(cid)*0.01) from mytable);
The sql can work very well in in 21.4.5.46 version but can not work in 21.10.2.15.
Exception is : [1002] ClickHouse exception, message: Code: 440. DB::Exception: Illegal type Nullable(Int32) of LIMIT expression, must be numeric type. (INVALID_LIMIT_EXPRESSION) (version 21.10.2.15 (official build))
How to reproduce
1 create table sql:
create table mytable(cid String,create_time String,sid String)engine = MergeTree PARTITION BY sid ORDER BY cid SETTINGS index_granularity = 8192;
2 execute sql
select * from mytable order by sid limit (select toInt64(count(cid)*0.01) from mytable);
ClickHouse release v21.9, 2021-09-09
Backward Incompatible Change
Now, scalar subquery always returns Nullable result if it's type can be Nullable. It is needed because in case of empty subquery it's result should be Null. Previously, it was possible to get error about incompatible types (type deduction does not execute scalar subquery, and it could use not-nullable type). Scalar subquery with empty result which can't be converted to Nullable (like Array or Tuple) now throws error. Fixes #25411. #26423 (Nikolai Kochetov).
Now you should use
SELECT *
FROM mytable
ORDER BY sid ASC
LIMIT assumeNotNull((
SELECT toUInt64(count(cid) * 0.01)
FROM mytable
))
Query id: e3ab56af-96e4-4e01-812d-39af945d7878
Ok.
0 rows in set. Elapsed: 0.004 sec.

Retrieving data from oracle table using scala jdbc giving wrong results

I am using scala jdbc to check whether a partition exists for an oracle table. It is returning wrong results when an aggregate function like count(*) is used.
I have checked the DB connectivity and other queries are working fine. I have tried to extract the value of count(*) using an alias, But it failed. Also tried using getString. But it failed.
Class.forName(jdbcDriver)
var connection = DriverManager.getConnection(jdbcUrl,dbUser,pswd)
val statement = connection.createStatement()
try{
val sqlQuery = s""" SELECT COUNT(*) FROM USER_TAB_PARTITIONS WHERE
TABLE_NAME = \'$tableName\' AND PARTITION_NAME = \'$partitionName\' """
val resultSet1 = statement.executeQuery(sqlQuery)
while(resultSet1.next())
{
var cnt=resultSet1.getInt(1)
println("Count="+cnt)
if(cnt==0)
// Code to add partition and insert data
else
//code to insert data in existing partition
}
}catch(Exception e) { ... }
The value of cnt always prints as 0 even though the oracle partition already exists. Can you please let me know what is the error in the code? Is this giving wrong results because I am using scala jdbc to get the result of an aggregate function like count(*)? If yes, then what would be the correct code? I need to use scala jdbc to check whether the partition already exists in oracle and then insert data accordingly.
This is just a suggestion or might be the solution in your case.
Whenever you search the metadata tables of the oracle always use UPPER or LOWER on both side of equal sign.
Oracle converts every object name in to the upper case and store it in the metadata unless you have specifically provided the lower case object name in double quotes while creating it.
So take an following example:
-- 1
CREATE TABLE "My_table_name1" ... -- CASE SENSISTIVE
-- 2
CREATE TABLE My_table_name2 ... -- CASE INSENSITIVE
In first query, we used double quotes so it will be stored in the metadata of the oracle as case sensitive name.
In second query, We have not used double quotes so the table name will be converted into the upper case and stored in the metadata of the oracle.
So If you want to create a query against any metadata in the oracle which include both of the above cases then you can use UPPER or LOWER against the column name and value as following:
SELECT * FROM USER_TABLES WHERE UPPER(TABLE_NAME) = UPPER('<YOUR TABLE NAME>');
Hope, this will help you in solving the issue.
Cheers!!

Is there any replacement for ROWNUM in Oracle?

I have JPA Native queries to an Oracle database. The only way I know to limit results is using 'rownum' in Oracle, but for some reason, query parser of a jar driver I have to use does not recognize it.
Caused by: java.sql.SQLException: An exception occurred when executing the following query: "/* dynamic native SQL query */ SELECT * from SFDC_ACCOUNT A where SBSC_TYP = ? and rownum <= ?". Cause: Invalid column name 'rownum'. On line 1, column 90. [parser-2900650]
com.compositesw.cdms.services.parser.ParserException: Invalid column name 'rownum'. On line 1, column 90. [parser-2900650]
How can I get rid of that?
ANSI Standard would be something like the following
SELECT *
FROM (
SELECT
T.*,
ROW_NUMBER() OVER (PARTITION BY T.COLUMN ORDER BY T.COLUMN) ROWNUM_REPLACE
FROM TABLE T
)
WHERE
1=1
AND ROWNUM_REPLACE < 100
or you could also use the following:
SELECT * FROM TABLE T
ORDER BY T.COLUMN
OFFSET 0 ROWS
FETCH NEXT 100 ROWS ONLY;

Unable to get only first occurrence of each job

I am trying to query some jobs from a repo, however I only need the job with the latest start time. I have tried using ROW_NUMBER for this and select only row number 1 for each job, however it doesn't seem to fall through:
SELECT a.jobname||','||a.projectname||','||a.startdate||','||a.enddate||','||
ROW_NUMBER() OVER ( PARTITION BY a.jobname ORDER BY a.startdate DESC ) AS "rowID"
FROM taskhistory a
WHERE a.jobname IS NOT NULL AND a.startdate >= (SYSDATE-1))LIMIT 1 AND rowID = 1;
ERROR at line 7:
ORA-00932: inconsistent datatypes: expected ROWID got NUMBER
Can I please ask for some assistance?
You have aliased your concatenated string "rowID" which is a mistake because it clashes with the Oracle keyword rowid. This is a special datatype, which allows us to identify table rows by their physical location. Find out more.
When you reference the column alias you omitted the fouble quotes. Oracle therefore interprets it as the keyword, rowid, and expects an expression which can be converted to the ROWID datatype.
Double-quoted identifiers are always a bad idea. Avoid them unless truly necessary.
Fixing the column alias will reveal the logic bug in your code. You are concatenating a whole slew of columns together, including the ROW_NUMBER() function, and calling that string "rowID". Clearly that string is never going to equal one, so this will filter out all rows:
and "rowID" = 1
Also LIMIT is not valid in Oracle.
What you need to do is use a sub-query, like this
SELECT a.jobname||','
||a.projectname||','
||a.startdate||','
||a.enddate||','
||to_char(a.rn) as "rowID"
FROM (
SELECT jobname
, projectname
, startdatem
, enddate,
, ROW_NUMBER() OVER ( PARTITION BY jobname
ORDER BY startdate DESC ) AS RN
FROM taskhistory
WHERE jobname IS NOT NULL
AND a.startdate >= (SYSDATE-1)
) a
where a.RN = 1;
Concatenating the projection like that seems an odd thing to do but I don't understand your business requirements.

Group By in Hive on partitioned table gives duplicate result rows

Using release 0.11.0. I get incorrect results when trying to execute this query
select t1.symbol, max(t1.maxts - t1.orderts) as diff from
(select catid, symbol, max(cast(timestamp as double)*1000) as maxts, min(cast(timestamp as double)*1000) as orderts, count(*) as cnt
from cat where recordtype in (0,1) and customerid=srcrepid group by symbol, catid) t1
where t1.cnt > 1
group by t1.symbol;
As you can see, there is a subquery with a group by statement. This subquery calculates the maximum and minimum of a timestamp value per MYID and SYMBOL.
Now, I have 24 symbols. In the outer query, I want to find the max difference per SYMBOL and so I group by SYMBOL.
The problem is that this returns 864 result rows right now. Hive seems to fail to reduce the last result into what I would expect to see.
Is this a bug? Can anyone reproduce this? I have 6 nodes running with 4 symbols per node.
Table used:
create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
hive.exec.max.dynamic.partitions.pernode=1000;
Edited: Edited because the query was inconsistent with the actual table used, making it hard to provide any help...
As explained by Yin on the hive mail list this is a bug related to this bug.
When Hive only uses a single mapreduce job, both partitioning columns are used whereas my query would only like to group by symbol.
Evidently this bug has been fixed in trunk.
And here's another bug report that demonstrates the problem more clearly
I think it might work if, in the outer query, you structure it thisaway:
SELECT t1.symbol, max(t1.maxts) - min(t1.orderts) AS diff, ....
I have seen that if you introduce an ORDER BY clause it after the first GROUP BY forces hive into two MR jobs and there by gives the correct results.
As requested adding the query modification as an example.
select t1.symbol, max(t1.maxts - t1.orderts) as diff from
(select catid, symbol, max(cast(timestamp as double)*1000) as maxts, min(cast(timestamp as double)1000) as orderts, count() as cnt
from cat where recordtype in (0,1) and customerid=srcrepid group by symbol, catid ORDER BY symbol, catid) t1
where t1.cnt > 1
group by t1.symbol;
But yes this is still only a work around the issue, but the real problem is Hive uses the wrong partitioning fields in that query, it should have just used symbol but if you see the explain on that it uses both symbol and catid which causes it to give multiple results.
Adding the ORDER BY forces Hive to do the second group by in a different MR job giving us the right results.

Resources