How to get latest inserted data in hive - hadoop

I have table as,
create table names
(name string,insert_time timestamp)
row format delimited
fields terminated by ','
stored as textfile;
select * from names;
OK
abc 2017-05-06 10:11:30
abc 2017-05-07 11:15:40
pqr 2017-05-06 12:11:10
I want to fetch only latest inserted data.
O/P should be as follows,
abc 2017-05-07 11:15:40
pqr 2017-05-06 12:11:10
Please Guide how to get this.

Use order by and limit:
SELECT * FROM names ORDER BY insert_time DESC LIMIT 2
order by desc will sort the records by timestamp in decreasing order, limit n will return only the first n records.

Related

Combining CLOB columns in Query

I have a table with a CLOB column. What I need to do is query the table, and combine the CLOB column of each row into a single CLOB column.
So, say I have something like:
ABC CLOB_VALUE1
ABC CLOB_VALUE2
ABC CLOB_VALUE2
What I need at output is:
ABC Combined Value (CLOB_VALUE1, CLOB_VALUE2, CLOB_VALUE3)
LISTAGG will not work due to the length, and I'm not having any luck with XMLAGG (unless I am doing it wrong).
I tried this, but it is not retrieving all the records:
SELECT id, XMLAGG(XMLELEMENT(E,price_string||',') ORDER BY
price_date).EXTRACT('//text()').getclobval() AS daily_7d_prices
FROM daily_price_coll
WHERE price_date >= TRUNC(SYSDATE) - 7
GROUP BY id;
I'm only getting the most recent row, when there are actually 3 rows in the table.
Any ideas?

How to retrieve workflow attribute values from workflow table?

I have a situation where in I need to take the values from table column which has data based on one of the column in same table.
There are two column values like that which is required to compare with another table.
Scenario:
Column 1 query:
SELECT text_value
FROM WF_ITEM_ATTRIBUTE_VALUES
WHERE name LIKE 'ORDER_ID' --AND number_value IS NOT NULL
AND Item_type LIKE 'ABC'
this query returns 14 unique records
Column 2 query:
SELECT number_value
FROM WF_ITEM_ATTRIBUTE_VALUES
WHERE name LIKE 'Source_ID' --AND number_value IS NOT NULL
AND Item_type LIKE 'ABC'
this also returns 14 records
and order_id of column 1 query is associated with source_id of column 2 query using this two column values i want to compare 14 records combined order_id, source_id with another table column i.e. Sales_tbl
columns sal_order_id, sal_source_id
Sample Data from WF_ITEM_ATTRIBUTE_VALUES:
Note: same data in the sales_tbl table but order_id is sal_order_id and sal_source_id
Order_id
204994 205000 205348 198517 198176 196856 204225 205348 203510 206528 196886 198971 194076 197940
Source_id
92262138 92261783 92262005 92262615 92374992 92375051 92374948 92375000 92375011 92336793 92374960 92691360 92695445 92695880
Desired O/p based on comparison:
Please help me in writing the query

How to delete duplicate records from Hive table?

I am trying to learn about deleting duplicate records from a Hive table.
My Hive table: 'dynpart' with columns: Id, Name, Technology
Id Name Technology
1 Abcd Hadoop
2 Efgh Java
3 Ijkl MainFrames
2 Efgh Java
We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.
Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.
You can use insert overwrite statement to update data
insert overwrite table dynpart select distinct * from dynpart;
Just in case when your table has duplicate rows on few or selected columns. Suppose you have a table structure as shown down below:
id Name Technology
1 Abcd Hadoop
2 Efgh Java --> Duplicate
3 Ijkl Mainframe
2 Efgh Python --> Duplicate
Here id & Name columns having duplicate rows.
You can use analytical function to get the duplicate row as:
select * from
(select Id,Name,Technology,
row_Number() over (partition By Id,Name order by id desc) as row_num
from yourtable)tab
where row_num > 1;
This will give you output as:
id Name Technology row_num
2 Efgh Python 2
When you need to get both the duplicate rows:
select * from
(select Id,Name,Technology,
count(*) over (partition By Id,Name order by id desc) as duplicate_count
from yourtable)tab
where duplicate_count> 1;
Output as:
id Name Technology duplicate_count
2 Efgh Java 2
2 Efgh Python 2
you can insert distinct records into some other table
create table temp as select distinct * from dynpart

My Hadoop interview scenario based query -solution can be in HIVE/PIG/MapReduce

I have data in a file like below(comma(,) separated).
ID,Name,Sal
101,Ramesh,M,1000
102,Prasad,K,500
I want the output table to be like below
101, Ramesh M, 1000
102, Prasad K, 500
i.e Name and Surname in a single column in the output
In Hive if I give row format delimited fields terminated by ',' it will not work. Do we need to write a serde?
Solution can be in MR or PIG also.
Why you dont use concat function, if you dont want process data and just query the raw data, think about creating a view on it :
select ID,concat(Name ,' ' ,Surname),Sal from table;
You can use concat function.
First, You can create the table(i.e. table1) with raw data having 4 columns delimited by comma :
ID, first_name,last_name, salary
Then concat the first_name and last_name using select query and store the results in another table using CTAS(Create TABLE AS SELECT) feature
CREATE TABLE EMP_TABLE AS SELECT ID, CONCAT(first_name,' ','last_name) as NAME, salary from table1

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Resources