Group By in Hive on partitioned table gives duplicate result rows - hadoop

Using release 0.11.0. I get incorrect results when trying to execute this query
select t1.symbol, max(t1.maxts - t1.orderts) as diff from
(select catid, symbol, max(cast(timestamp as double)*1000) as maxts, min(cast(timestamp as double)*1000) as orderts, count(*) as cnt
from cat where recordtype in (0,1) and customerid=srcrepid group by symbol, catid) t1
where t1.cnt > 1
group by t1.symbol;
As you can see, there is a subquery with a group by statement. This subquery calculates the maximum and minimum of a timestamp value per MYID and SYMBOL.
Now, I have 24 symbols. In the outer query, I want to find the max difference per SYMBOL and so I group by SYMBOL.
The problem is that this returns 864 result rows right now. Hive seems to fail to reduce the last result into what I would expect to see.
Is this a bug? Can anyone reproduce this? I have 6 nodes running with 4 symbols per node.
Table used:
create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
hive.exec.max.dynamic.partitions.pernode=1000;
Edited: Edited because the query was inconsistent with the actual table used, making it hard to provide any help...

As explained by Yin on the hive mail list this is a bug related to this bug.
When Hive only uses a single mapreduce job, both partitioning columns are used whereas my query would only like to group by symbol.
Evidently this bug has been fixed in trunk.
And here's another bug report that demonstrates the problem more clearly

I think it might work if, in the outer query, you structure it thisaway:
SELECT t1.symbol, max(t1.maxts) - min(t1.orderts) AS diff, ....

I have seen that if you introduce an ORDER BY clause it after the first GROUP BY forces hive into two MR jobs and there by gives the correct results.
As requested adding the query modification as an example.
select t1.symbol, max(t1.maxts - t1.orderts) as diff from
(select catid, symbol, max(cast(timestamp as double)*1000) as maxts, min(cast(timestamp as double)1000) as orderts, count() as cnt
from cat where recordtype in (0,1) and customerid=srcrepid group by symbol, catid ORDER BY symbol, catid) t1
where t1.cnt > 1
group by t1.symbol;
But yes this is still only a work around the issue, but the real problem is Hive uses the wrong partitioning fields in that query, it should have just used symbol but if you see the explain on that it uses both symbol and catid which causes it to give multiple results.
Adding the ORDER BY forces Hive to do the second group by in a different MR job giving us the right results.

Related

Can I use FOR ALL ENTRIES with GROUP BY?

Currently the code looks something like this:
LOOP AT lt_orders ASSIGNING <fs_order>.
SELECT COUNT(*) AS cnt
FROM order_items
INTO <fs_order>-cnt
WHERE order_id = <fs_order>-order_id.
ENDLOOP.
It is the slowest part of the report. I want to speed it up.
How can I use FOR ALL ENTRIES with GROUP BY?
Check the documentation. You can't use GROUP BY. Maybe in this case, you could try selecting your items with FAE outside of the loop, then count them using a parallel cursor:
REPORT.
TYPES: BEGIN OF ty_result,
vbeln TYPE vbeln,
cnt TYPE i.
TYPES: END OF ty_result.
DATA: lt_headers TYPE SORTED TABLE OF ty_result WITH UNIQUE KEY vbeln,
lv_tabix TYPE sy-tabix VALUE 1.
"get the headers
SELECT vbeln FROM vbak UP TO 100 ROWS INTO CORRESPONDING FIELDS OF TABLE lt_headers.
"get corresponding items
SELECT vbeln, posnr FROM vbap FOR ALL ENTRIES IN #lt_headers
WHERE vbeln EQ #lt_headers-vbeln
ORDER BY vbeln, posnr
INTO TABLE #DATA(lt_items).
LOOP AT lt_headers ASSIGNING FIELD-SYMBOL(<h>).
LOOP AT lt_items FROM lv_tabix ASSIGNING FIELD-SYMBOL(<i>).
IF <i>-vbeln NE <h>-vbeln.
lv_tabix = sy-tabix.
EXIT.
ELSE.
<h>-cnt = <h>-cnt + 1.
ENDIF.
ENDLOOP.
ENDLOOP.
BREAK-POINT.
Or join header/item with a distinct count on the item id (whichever column that would be in your table).
You should be able to do something like
SELECT COUNT(order_item_id) AS cnt, order_id
FROM order_items
INTO CORRESPONDING FIELDS OF TABLE lt_count
GROUP BY order_id.
Assuming that order_item_id is a key in the order_items table. And assuming that lt_count has two fields: cnt of type int8 and order_id of same type as your other order_id fields
PS: then you can loop over lt_count and move the counts to lt_orders. Or the other way around. To speed up the loop, sort one of the tables and use READ ... BINARY SEARCH
I did with table KNB1 (customer master in company code), where we have customers, which are created in several company codes.
Please note, because of FOR ALL ENTRIES you have to SELECT the full key.
TYPES: BEGIN OF ty_knb1,
kunnr TYPE knb1-kunnr,
count TYPE i,
END OF ty_knb1.
TYPES: BEGIN OF ty_knb1_fae,
kunnr TYPE knb1-kunnr,
END OF ty_knb1_fae.
DATA: lt_knb1_fae TYPE STANDARD TABLE OF ty_knb1_fae.
DATA: lt_knb1 TYPE HASHED TABLE OF ty_knb1
WITH UNIQUE KEY kunnr.
DATA: ls_knb1 TYPE ty_knb1.
DATA: ls_knb1_db TYPE knb1.
START-OF-SELECTION.
lt_knb1_fae = VALUE #( ( kunnr = ... ) ). "add at least one customer which is created in several company codes
ls_knb1-count = 1.
SELECT kunnr bukrs
INTO CORRESPONDING FIELDS OF ls_knb1_db
FROM knb1
FOR ALL ENTRIES IN lt_knb1_fae
WHERE kunnr EQ lt_knb1_fae-kunnr.
ls_knb1-kunnr = ls_knb1_db-kunnr.
COLLECT ls_knb1 INTO lt_knb1.
ENDSELECT.
Create a range table for your lt_orders, like lt_orders_range.
Do select order_id, count( * ) where order_id in lt_orders_range.
If you think this is too much to create a range table, you will save a lot of performance by running just one select for all orders instead of single select for each order id.
Not directly, only through a CDS view
While all of the answers provide a faster solution than the one in the question, the fastest way is not mentioned.
If you have at least Netweaver 7.4, EHP 5 (and you should, it was released in 2014), you can use CDS views, even if you are not on HANA.
It still cannot be done directly, as OpenSQL does not allow FOR ALL ENTRIES with GROUP BY, and CDS views cannot handle FOR ALL ENTRIES. However, you can create one of each.
CDS:
#AbapCatalog.sqlViewName: 'zorder_i_fae'
DEFINE VIEW zorder_items_fae AS SELECT FROM order_items {
order_id,
count( * ) AS cnt,
}
GROUP BY order_id
OpenSQL:
SELECT *
FROM zorder_items_fae
INTO TABLE #DATA(lt_order_cnt)
FOR ALL ENTRIES IN #lt_orders
WHERE order_id = #lt_orders-order_id.
Speed
If lt_orders contains more than about 30% of all possible order_id values from table ORDER_ITEMS, the answer from iPirat is faster. (While using more memory, obviously)
However, if you need only a couple hunderd order_id values out of millions, this solution is about 10 times faster than any other answer, and 100 times faster than the original.

How to order by field mixed with string and number in Oracle?

These are the field (crane_no) values to be sorted
QC11QC10QC9
I tried the following query:
select * from table order by crane_no DESC
but query results does not give in an order because the field is mixed with staring and number (Example:QC12).
I get following results for above query:
QC9QC11QC10
I want the results to be in order (QC9, QC10, QC11). Thanks
If the data isn't huge, I'd use a regex order by clause:
select
cran_no
from your_table
order by
regexp_substr(cran_no, '^\D*') nulls first,
to_number(regexp_substr(cran_no, '\d+'))
This looks for the numbers in the string, so rows like 'QCC20', 'DCDS90' are ordered properly; it also takes care of nulls.
One approach is to extract the numeric portion of the crane_no columns using SUBSTR(), cast to an integer, and order descending by this value.
SELECT *
FROM yourTable
ORDER BY CAST(SUBSTR(crane_no, 3) AS INT) DESC
Note that I assume in my answer that every entry in crane_no is prefixed with the fixed width QC. If not, then we would have to do more work to identify the numerical component.
select ...
order by to_number( substr( crane_no,3 )) desc

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING

Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message

UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Unable to get only first occurrence of each job

I am trying to query some jobs from a repo, however I only need the job with the latest start time. I have tried using ROW_NUMBER for this and select only row number 1 for each job, however it doesn't seem to fall through:
SELECT a.jobname||','||a.projectname||','||a.startdate||','||a.enddate||','||
ROW_NUMBER() OVER ( PARTITION BY a.jobname ORDER BY a.startdate DESC ) AS "rowID"
FROM taskhistory a
WHERE a.jobname IS NOT NULL AND a.startdate >= (SYSDATE-1))LIMIT 1 AND rowID = 1;
ERROR at line 7:
ORA-00932: inconsistent datatypes: expected ROWID got NUMBER
Can I please ask for some assistance?
You have aliased your concatenated string "rowID" which is a mistake because it clashes with the Oracle keyword rowid. This is a special datatype, which allows us to identify table rows by their physical location. Find out more.
When you reference the column alias you omitted the fouble quotes. Oracle therefore interprets it as the keyword, rowid, and expects an expression which can be converted to the ROWID datatype.
Double-quoted identifiers are always a bad idea. Avoid them unless truly necessary.
Fixing the column alias will reveal the logic bug in your code. You are concatenating a whole slew of columns together, including the ROW_NUMBER() function, and calling that string "rowID". Clearly that string is never going to equal one, so this will filter out all rows:
and "rowID" = 1
Also LIMIT is not valid in Oracle.
What you need to do is use a sub-query, like this
SELECT a.jobname||','
||a.projectname||','
||a.startdate||','
||a.enddate||','
||to_char(a.rn) as "rowID"
FROM (
SELECT jobname
, projectname
, startdatem
, enddate,
, ROW_NUMBER() OVER ( PARTITION BY jobname
ORDER BY startdate DESC ) AS RN
FROM taskhistory
WHERE jobname IS NOT NULL
AND a.startdate >= (SYSDATE-1)
) a
where a.RN = 1;
Concatenating the projection like that seems an odd thing to do but I don't understand your business requirements.

Resources