HQL throws ArrayList cannot be cast to org.apache.hadoop.io.Text - hadoop

I have a query which fails when reducing, the error which is thrown is:
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
However, when going deeper into the YARN logs, I was able to find this:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:265) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"reducesinkkey0":"2020-05-05","reducesinkkey1":10039,"reducesinkkey2":103,"reducesinkkey3":"2020-05-05","reducesinkkey4":10039,"reducesinkkey5":103},"value":{"_col0":103,"_col1":["1","2"]}} at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:253) ... 7 more Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
The most relevant part being:
java.util.ArrayList cannot be cast to org.apache.hadoop.io.Text
This is the query which I'm executing (FYI: this is a subquery within a bigger query):
SELECT
yyyy_mm_dd,
h_id,
MAX(CASE WHEN rn=1 THEN prov_id ELSE NULL END) OVER (partition by yyyy_mm_dd, h_id) as primary_prov,
collect_set(api) OVER (partition by yyyy_mm_dd, h_id, p_id) prov_id_api, --re-assemple array to include all elements from multiple initial arrays if there are different arrays per prov_id
prov_id
FROM(
SELECT --get "primary prov" (first element in ascending array))
yyyy_mm_dd,
h_id,
prov_id,
api,
ROW_NUMBER() OVER(PARTITION BY yyyy_mm_dd, h_id ORDER BY api) rn
FROM(
SELECT --explode array to get data at row level
t.yyyy_mm_dd,
t.h_id,
prov_id,
collect_set(--array of integers, use set to remove duplicates
CASE
WHEN e.apis_xml_element = 'res' THEN 1
WHEN e.apis_xml_element = 'av' THEN 2
...
...
ELSE e.apis_xml_element
END) as api
FROM
mytable t
LATERAL VIEW EXPLODE(apis_xml) e AS apis_xml_element
WHERE
yyyy_mm_dd = "2020-05-05"
AND t.apis_xml IS NOT NULL
GROUP BY
1,2,3
)s
)s
I have further narrowed the issue down to the top level select, as the inner select works fine by itself, which makes me believe the issue is happening here specifically:
collect_set(api) OVER (partition by yyyy_mm_dd, h_id, prov_id) prov_id_api
However, I'm unsure how to solve it. At the most inner select, apis_xml is an array<string> which holds strings such as 'res' and 'av' up until a point. Then integers are used. Hence the case statement to align these.
Strangely, if I run this via Spark i.e. spark.sql=(above_query), it works. However, on beeline via HQL, the job gets killed.

Remove collect_set in the inner query, because it already produces array, upper collect_set should receive scalars. Also remove group by in the inner query, because without collect_set there is no aggregation any more. You can use DISTINCT if you need to remove duplicates

Related

ORACLE APEX-ORA 01722 Invalid Number ERROR

I don't understand , why I am taking this error.
SELECT
to_char(view_date, 'Month') MONAT ,
COUNT(*) AS countx
FROM
AXY_TABLE
GROUP BY
to_char(view_date,'Month')
ORDER BY
to_char(view_date, 'Month'),
COUNT(*) desc;
When I execute this Query for a Interactive Report, it throws ORA-01722 Error. This Query run not only correctly in SQL developer but also as Classic Report correctly. When I changed the type to Interactive Report, throws it again the same error.
What should I do ?
Thanks a lot in advance.
Is the problem the ORDER BY clause? Try removing the aggregation from it:
SELECT
to_char(view_date, 'Month') MONAT ,
COUNT(*) AS countx
FROM
AXY_TABLE
GROUP BY
to_char(view_date,'Month')
ORDER BY
to_char(view_date, 'Month'),
2 desc;
This orders by the position of the second column of the projection. However, it is strictly unnecessary, as your result set will conatin only one row per MONTH, so you only need to sort by that.

Is there any replacement for ROWNUM in Oracle?

I have JPA Native queries to an Oracle database. The only way I know to limit results is using 'rownum' in Oracle, but for some reason, query parser of a jar driver I have to use does not recognize it.
Caused by: java.sql.SQLException: An exception occurred when executing the following query: "/* dynamic native SQL query */ SELECT * from SFDC_ACCOUNT A where SBSC_TYP = ? and rownum <= ?". Cause: Invalid column name 'rownum'. On line 1, column 90. [parser-2900650]
com.compositesw.cdms.services.parser.ParserException: Invalid column name 'rownum'. On line 1, column 90. [parser-2900650]
How can I get rid of that?
ANSI Standard would be something like the following
SELECT *
FROM (
SELECT
T.*,
ROW_NUMBER() OVER (PARTITION BY T.COLUMN ORDER BY T.COLUMN) ROWNUM_REPLACE
FROM TABLE T
)
WHERE
1=1
AND ROWNUM_REPLACE < 100
or you could also use the following:
SELECT * FROM TABLE T
ORDER BY T.COLUMN
OFFSET 0 ROWS
FETCH NEXT 100 ROWS ONLY;

Hive Getting error on group by column while using case statements and aggregations

I am working on a query in hive. In that I am using aggregations like sum and case statements and group by clause. I have changed the column names and table names but my logic is same which I was using in my project
select
empname,
empsal,
emphike,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
case when tot_sal > 1000 then exp(tot_hike)
else 0
end as manager
from employee
group by
empname,
empsal,
emphike
For the above query I was getting error as "Expression not in group by key '1000'".
So I have slightly modified the query and tried again My other query is
select
empname,
empsal,
emphike,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
case when sum(empsal) > 1000 then exp(sum(emphike))
else 0
end as manager
from employee
group by
empname,
empsal,
emphike
For above query its putting me error as "Expression not in group by key 'Manager'".
When I add manager in the group by its showing invalid alias.
Please help me out here
I see three issues in your query:
1.) Hive cannot group by a variable you defined in the select block by the name you gave it right away. You will probably need a subquery for that.
2.) Hive tends to show errors when sum or count operations are not at the end of the query.
3.) Although I do not know what your goal is, I think that your query will not deliver the desired result. If you group by empsal there would be no difference between empsal and sum(empsal) by design. Same goes for emphike and sum(emphike).
I think the following query might solve these issues:
select
a.empname,
a.tot_sal,
a.tot_hike,
if(a.tot_sal > 1000, exp(a.tot_hike), 0) as manager
from
(select
empname,
sum(empsal) as tot_sal,
sum(emphike) as tot_hike,
from employee
group by
empname
)a
The if statement is equivalent to your case statement, however I find it a bit easier to read.
In this example you wouldn't need to group by after the subquery because the grouping is done in the subquery a.

Hive 0.13.0, partition and buckets

I am getting an error message, which is very different from 2 test runs.
I verified data types data and exactly - double value but there is an issue with type cast.
Why does this occur?, please help me to fix
DROP TABLE XXSCM_SRC_SHIPMENTS;
CREATE TABLE IF NOT EXISTS XXSCM_SRC_SHIPMENTS(
INVENTORY_ITEM_ID DOUBLE
,ORDERED_ITEM STRING
,SHIP_FROM_ORG_ID DOUBLE
,QTR_START_DATE STRING
,QTR_END_DATE STRING
,SEQ DOUBLE
,EXTERNAL_SHIPMENTS DOUBLE
-- ,PREV_EXTERNAL_SHIPMENTS DOUBLE
,INTERNAL_SHIPMENTS DOUBLE
--,PREV_INTERNAL_SHIPMENTS DOUBLE
,AVG_SELL_PRICE DOUBLE)
--,PREV_AVG_SELL_PRICE DOUBLE)
COMMENT 'DIMENTION FOR THE SHIPMENTS LOCAL AND GLOBAL'
PARTITIONED BY (ORGANIZATION_CODE STRING, FISCAL_PERIOD STRING)
CLUSTERED BY (INVENTORY_ITEM_ID, ORDERED_ITEM, SHIP_FROM_ORG_ID, QTR_START_DATE, QTR_END_DATE, SEQ)
SORTED BY (INVENTORY_ITEM_ID ASC, ORDERED_ITEM ASC, SHIP_FROM_ORG_ID ASC, QTR_START_DATE ASC, QTR_END_DATE ASC, SEQ ASC)
INTO 256 BUCKETS
STORED AS ORC TBLPROPERTIES("orc.compress"="SNAPPY");
1) Error Fails
SELECT inventory_item_id,ordered_item,ship_from_org_id,qtr_start_date,qtr_end_date,seq,external_shipments FROM supply_chain_pcam.XXSCM_SRC_SHIPMENTS limit 100
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DoubleWritable
2) Got the result successfully
hive -e "set hive.cli.print.header=true;select * from supply_chain_pcam.xxscm_src_shipments limit 100"
There is a problem in SHIP_FROM_ORG_ID field. You will need to re-create the table XXSCM_SRC_SHIPMENTS with the correct data type. Hive is not able to parse that field. You almost had the answer - if
select *
is fetching the result then try-out individual fields - Here it says a cast exception to double so you can take only the double fields.

Unable to get only first occurrence of each job

I am trying to query some jobs from a repo, however I only need the job with the latest start time. I have tried using ROW_NUMBER for this and select only row number 1 for each job, however it doesn't seem to fall through:
SELECT a.jobname||','||a.projectname||','||a.startdate||','||a.enddate||','||
ROW_NUMBER() OVER ( PARTITION BY a.jobname ORDER BY a.startdate DESC ) AS "rowID"
FROM taskhistory a
WHERE a.jobname IS NOT NULL AND a.startdate >= (SYSDATE-1))LIMIT 1 AND rowID = 1;
ERROR at line 7:
ORA-00932: inconsistent datatypes: expected ROWID got NUMBER
Can I please ask for some assistance?
You have aliased your concatenated string "rowID" which is a mistake because it clashes with the Oracle keyword rowid. This is a special datatype, which allows us to identify table rows by their physical location. Find out more.
When you reference the column alias you omitted the fouble quotes. Oracle therefore interprets it as the keyword, rowid, and expects an expression which can be converted to the ROWID datatype.
Double-quoted identifiers are always a bad idea. Avoid them unless truly necessary.
Fixing the column alias will reveal the logic bug in your code. You are concatenating a whole slew of columns together, including the ROW_NUMBER() function, and calling that string "rowID". Clearly that string is never going to equal one, so this will filter out all rows:
and "rowID" = 1
Also LIMIT is not valid in Oracle.
What you need to do is use a sub-query, like this
SELECT a.jobname||','
||a.projectname||','
||a.startdate||','
||a.enddate||','
||to_char(a.rn) as "rowID"
FROM (
SELECT jobname
, projectname
, startdatem
, enddate,
, ROW_NUMBER() OVER ( PARTITION BY jobname
ORDER BY startdate DESC ) AS RN
FROM taskhistory
WHERE jobname IS NOT NULL
AND a.startdate >= (SYSDATE-1)
) a
where a.RN = 1;
Concatenating the projection like that seems an odd thing to do but I don't understand your business requirements.

Resources