Getting GC overhead limit exceeded with Hive query - hadoop

I have a hive table tableA which is partitioned on adv_id and dt
col_name data_type
adv_id int
url string
dt int
I am firing a query.
SELECT adv_id, url FROM tableA
WHERE
dt = '20200510'
AND adv_id IN (5039) limit 100;
Why do i get a GC overhead limit exceeded error and how do I fix this?
Note that using adv_id = 5039 works here. But i need to check for multiple values and hence the IN clause.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.SingleByteCharsetConverter.toString(SingleByteCharsetConverter.java:335)
at com.mysql.jdbc.ResultSetRow.getString(ResultSetRow.java:819)
at com.mysql.jdbc.ByteArrayRow.getString(ByteArrayRow.java:70)
at com.mysql.jdbc.ResultSetImpl.getStringInternal(ResultSetImpl.java:5816)
at com.mysql.jdbc.ResultSetImpl.getString(ResultSetImpl.java:5693)
at org.datanucleus.store.rdbms.mapping.datastore.CharRDBMSMapping.getObject(CharRDBMSMapping.java:473)
at org.datanucleus.store.rdbms.mapping.java.SingleFieldMapping.getObject(SingleFieldMapping.java:220)
at org.datanucleus.store.rdbms.query.ResultClassROF.getObject(ResultClassROF.java:267)
at org.datanucleus.store.rdbms.query.ForwardQueryResult.nextResultSetElement(ForwardQueryResult.java:181)
at

Related

Limit select count subquery work in 21.4.5.46 version but can not work in 21.10.2.15

Limit select count subquery work in 21.4.5.46 version but can not work in 21.10.2.15
Sql is
select * from mytable order by sid limit (select toInt64(count(cid)*0.01) from mytable);
The sql can work very well in in 21.4.5.46 version but can not work in 21.10.2.15.
Exception is : [1002] ClickHouse exception, message: Code: 440. DB::Exception: Illegal type Nullable(Int32) of LIMIT expression, must be numeric type. (INVALID_LIMIT_EXPRESSION) (version 21.10.2.15 (official build))
How to reproduce
1 create table sql:
create table mytable(cid String,create_time String,sid String)engine = MergeTree PARTITION BY sid ORDER BY cid SETTINGS index_granularity = 8192;
2 execute sql
select * from mytable order by sid limit (select toInt64(count(cid)*0.01) from mytable);
ClickHouse release v21.9, 2021-09-09
Backward Incompatible Change
Now, scalar subquery always returns Nullable result if it's type can be Nullable. It is needed because in case of empty subquery it's result should be Null. Previously, it was possible to get error about incompatible types (type deduction does not execute scalar subquery, and it could use not-nullable type). Scalar subquery with empty result which can't be converted to Nullable (like Array or Tuple) now throws error. Fixes #25411. #26423 (Nikolai Kochetov).
Now you should use
SELECT *
FROM mytable
ORDER BY sid ASC
LIMIT assumeNotNull((
SELECT toUInt64(count(cid) * 0.01)
FROM mytable
))
Query id: e3ab56af-96e4-4e01-812d-39af945d7878
Ok.
0 rows in set. Elapsed: 0.004 sec.

ORA-01652: Unable to extend temp space

I have a procedure and a particular query in the procedure is generating about 50GB of temp space causing below exception after few executions.:
SQL state [72000]; error code [1652]; ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
DBAs are pointing to the below query in the stored procedure that needs to be rewritten. The tables used in the query below are smaller than 0.1 GB but the query generates 50GB of temp space!
SELECT tab1.ORDID ID1, tab2.ORDID ID2
FROM (
SELECT
OT.ORDID,
CONNECT_BY_ROOT OT.UNIQ_ORIG_KEY ORIG_UID
FROM order_tab OT, status_tab ST
WHERE OT.otype IN ('A','B')
AND OT.order_uid IS NULL
AND OT.BATCH_ID = ST.BATCH_ID
AND ST.CT_DATE = :A1
AND ST.BSTATUS = 1
CONNECT BY PRIOR OT.UNIQ_KEY = OT.UNIQ_ORIG_KEY
) tab1 , order_tab tab2
WHERE tab2.ORD_VERID = 1
AND tab1.ORIG_UID = tab2.UNIQ_KEY
ORDER BY ID1;
Could someone please help in rewriting the query efficiently so that temp space utilization is reduced. Database used Oracle 12c.

How to set range for limit clause in hive

How to set range for limit clause in hive , I have tried the below query but failed with syntax error . Can someone please help
select * from table limit 1000,2000;
You can use Row_Number window function and set the range limit.
Below Query will result only the first 20 records from the table
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 0 and rowid <=20;
Using Between operator to specify range
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid between 0 and 20;
To fetch rows from 20 to 40 then increase the lower/upper bound values
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 20 and rowid <=40;
The LIMIT clause is used to set a ceiling on the number of rows in the result set. You are getting a syntax error because of an incorrect usage of this HQL clause.
The query could be written as the following to return no more than 2000 rows:
SELECT * FROM table LIMIT 2000;
You could also write it like so to return no more than 1000 rows:
SELECT * FROM table LIMIT 1000;
However you cannot combine both into the same argument for LIMIT. The LIMIT argument must evaluate to a constant value.
I will try and expand on this information a bit to try and help solve your problem. If you are attempting to "paginate" your results the following may be of use.
FIRST I would recommend against leaning on HQL for pagination, in most situations that would be more efficiently implemented on the application logic side (query large result set, cache what you need, paginate with application logic). If you have no choice but to pull out ranges of rows you can get the desired effect through a combination of the LIMIT, ORDER BY, and OFFSET clauses.
LIMIT : This will limit your result set to a maximum number of row
ORDER BY: This will sort/order your result set based on one or more columns
OFFSET: This will start your result set at a certain row after the logical first entry in the table.
You may combine these three clauses to effectively query "pages" of your table. For example the following three queries show how to get the first 3 blocks of data from a table where each block contains 1000 rows and the target table's 'column1' is used to determine logical order.
SELECT title as "Page 1", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 0;
SELECT title as "Page 2", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 1000;
SELECT title as "Page 3", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 2000;
Each query declares 'column1' as the sorting value with ORDER BY. The queries will return no more than 1000 rows due to the LIMIT clause. Each result set will start at a different row due to the OFFSET being incremented by the "page size" for each query.
I am not sure what you are trying to achieve, but ...
That will return the 1001 and the 2001 record in the query results set only if you are using hive a hive version greater than 2.0.0
hive --version
(https://issues.apache.org/jira/browse/HIVE-11531)
Limit in Hive gives 'n' number of records randomly. It's not to print a range of records.
You may use order by in conjunction with limit to get what you want

java.sql.SQLException: ORA-01000: maximum open cursors exceeded. While generating jasper report

While generating report through jasper we are getting exception error as ...
Error filling print... net.sf.jasperreports.engine.JRException: Error executing SQL statement for : risk
net.sf.jasperreports.engine.JRRuntimeException: net.sf.jasperreports.engine.JRException: Error executing SQL statement for : risk
at net.sf.jasperreports.engine.fill.JRFillSubreport.prepare(JRFillSubreport.java:729)
Caused by: java.sql.SQLException: ORA-01000: maximum open cursors exceeded
How to resolve this?
Сonnect to the database and check the open_cursors limits:
select value from v$parameter where name='open_cursors'
So we list the top 20 sessions which are currently opening most cursors:
select * from ( select ss.value, sn.name, ss.sid from v$sesstat ss, v$statname sn where ss.statistic# = sn.statistic# and sn.name like '%opened cursors current%' order by value desc) where rownum < 21;
The solution is to increase the no. of the open_cursors parameter as :
alter system set open_cursors=400 scope=both

Hive substr with group by cause slow performance

I have a table which has about 300,000 records in each partition (ds).
When running the following query in hive (select with substr), it will hanging on the step: map = 0%
select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
select substr(stat_time,1,8) as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
from db.table
where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id
However, if I replace the inner query from select substr(stat_time,1,8) as stat_date to select stat_time as stat_date, it will execute normally.
stat_time is something formatted in YYYYMMDDHHmm = 201702210900
select t0.stat_date,t0.plat,t0.soh,t0.page_name,t0.component_name ,count(*) as num,user_id,t0.other_info
from (
select stat_time as stat_date,user_id,city_id,soh,dept_id,dph,plat,page_name,component_name,cookie_id,other_info
from db.table
where ds=20170221 and plat='abc'
)t0
group by stat_date,plat,soh,page_name,component_name,other_info,t0.user_id
So why substr leads to slow performance?
--
Edit:
I changed mapred.child.java.opts in mapred-site.xml, and HADOOP_HEAPSIZE in hadoop-env.sh to 4G. And it succeed.
I guess either saving or calculating all the substr causes a lot heap.
If anybody knows why using a plain field value is less memory causing than substr, please leave a comment/answer.

Resources