My Hadoop interview scenario based query -solution can be in HIVE/PIG/MapReduce - hadoop

I have data in a file like below(comma(,) separated).
ID,Name,Sal
101,Ramesh,M,1000
102,Prasad,K,500
I want the output table to be like below
101, Ramesh M, 1000
102, Prasad K, 500
i.e Name and Surname in a single column in the output
In Hive if I give row format delimited fields terminated by ',' it will not work. Do we need to write a serde?
Solution can be in MR or PIG also.

Why you dont use concat function, if you dont want process data and just query the raw data, think about creating a view on it :
select ID,concat(Name ,' ' ,Surname),Sal from table;

You can use concat function.
First, You can create the table(i.e. table1) with raw data having 4 columns delimited by comma :
ID, first_name,last_name, salary
Then concat the first_name and last_name using select query and store the results in another table using CTAS(Create TABLE AS SELECT) feature
CREATE TABLE EMP_TABLE AS SELECT ID, CONCAT(first_name,' ','last_name) as NAME, salary from table1

Related

Combining CLOB columns in Query

I have a table with a CLOB column. What I need to do is query the table, and combine the CLOB column of each row into a single CLOB column.
So, say I have something like:
ABC CLOB_VALUE1
ABC CLOB_VALUE2
ABC CLOB_VALUE2
What I need at output is:
ABC Combined Value (CLOB_VALUE1, CLOB_VALUE2, CLOB_VALUE3)
LISTAGG will not work due to the length, and I'm not having any luck with XMLAGG (unless I am doing it wrong).
I tried this, but it is not retrieving all the records:
SELECT id, XMLAGG(XMLELEMENT(E,price_string||',') ORDER BY
price_date).EXTRACT('//text()').getclobval() AS daily_7d_prices
FROM daily_price_coll
WHERE price_date >= TRUNC(SYSDATE) - 7
GROUP BY id;
I'm only getting the most recent row, when there are actually 3 rows in the table.
Any ideas?

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the speciļ¬ed columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.
Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

How insert overwrite table in hive with diffrent where clauses?

I want to read a .tsv file from Hbase into hive. The file has a columnfamily, which has 3 columns inside: news, social and all. The aim is to store these columns in an table in hbase which has the columns news, social and all.
CREATE EXTERNAL TABLE IF NOT EXISTS topwords_logs (key String,
columnfamily String , wort String , col String, occurance int)ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t'STORED AS TEXTFILE LOCATION '/home
/hfu/Testdaten';
load data local inpath '/home/hfu/Testdaten/part-r-00000.tsv' into table topwords_logs;
CREATE TABLE newtopwords (columnall int, columnsocial int , columnnews int) PARTITIONED BY(wort STRING) STORED AS SEQUENCEFILE;
Here i created a external table, which contain the data from hbase. Further on I created a table with the 3 columns.
What i have tried so far is this:
insert overwrite table newtopwords partition(wort)
select occurance ,'1', '1' , wort from topwords_log;
This Code works fine, but i have for each column an extra where clause. How can I insert data like this?
insert overwrite table newtopwords partition(wort)
values(columnall,(select occurance from topwords_logs where col =' all')),(columnnews,( select occurance from topwords_logs where col =' news')) ,(columnsocial,( select occurance from topwords_logs where col =' social')),(wort,(select wort from topwords_log));
This code isnt working ->NoViableAltException.
On every example I just see Code, where they insert data without a Where clause. How can I insert Data with a Where clause?

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Range Partitioning in Hive

Does hive support range partitioning?
I mean does hive supports something like below:
insert overwrite table table2 PARTITION (employeeId BETWEEN 2001 and 3000)
select employeeName FROM emp10 where employeeId BETWEEN 2001 and 3000;
Where table2 & emp10 has two columns:
employeeName &
employeeId
When I run the above query i am facing an error:
FAILED: ParseException line 1:56 mismatched input 'BETWEEN' expecting ) near 'employeeId' in destination specification
Is not possible. Here is a quote from Hive documentation :
A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns
No its not possible. Even I use separate calculated column like ,
insert overwrite table table2 PARTITION (employeeId_range)
select employeeName , employeeId/1000 FROM emp10 where employeeId BETWEEN 2000 and 2999;
which will make sure all values fall in same partition.
while querying the table since we already know the range calculator, we can
select employeeName , employeeId FROM table2 where employeeId_range=2;
Thus we can also parallelise the queries of given ranges.
Hope it helps.

Resources