Redshift SqlActivity : How to reference input and output in script - amazon-data-pipeline

I have a Datapipeline where I'm using a Redshift SqlActivity that read from a Redshift table and write in another Redshift table.
I would like to know if it is possible to reference the input and output field from the SqlActivity
e.g
INSERT INTO #{output1} (field1, field2)
SELECT field1, SUM(field2)
FROM #{input1}
GROUP BY field1;
Thanks

Yes you can. Your query will look like
INSERT INTO #{output.tableName} (field1, field2)
SELECT field1, SUM(field2)
FROM #{input.tableName}
GROUP BY field1;

Related

Using multiple select statement inside insert statement in Hive

I'm new in Hive. I have three tables like this:
table1:
id;value
1;val1
2;val2
3;val3
table2
num;desc;refVal
1;desc;0
2;descd;0
3;desc;0
I want to create a new table3 that contains:
num;desc;refVal
1;desc;3
2;descd;3
3;desc;3
Where num and desc are columns from table2 and refVal is the max value of column id in table1
Can someone guide me to solve this?
First, you have to create an table to hold this.
CREATE TABLE my_new_table;
After that, you have to insert into this table, as showed here
INSERT INTO TABLE my_new_table
[PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1;
In the select_statement1 you can use the same select you would normally use to join and select the columns you need.
For more informations, you can check here

Is there a way to prevent insertion of duplicate rows in Hive?

I have an ORC Table. I populate it using the data from some other table as follows:
INSERT INTO TABLE orc_table_name SELECT * FROM other_table_name
Is there any way I can prevent inserting of duplicate entries into the ORC Table?
you can use not in command See a general code below: it inserts records to the orc_table_name based on the fact that value1 from TABLE_1 was not inserted before.
INSERT INTO orc_table_name
(Value1, Value2)
SELECT t1.Value1,
t1.Value2
FROM TABLE_1 t1
WHERE t1.Value1 NOT IN (SELECT Value1 FROM orc_table_name)
INSERT INTO orc_table_name(field1,field2....fieldn)
select field1,field2... field(n-1),MIN(fieldn) as fieldn
from other_table_name
Group By field1,field2...field(n-1)

My Hadoop interview scenario based query -solution can be in HIVE/PIG/MapReduce

I have data in a file like below(comma(,) separated).
ID,Name,Sal
101,Ramesh,M,1000
102,Prasad,K,500
I want the output table to be like below
101, Ramesh M, 1000
102, Prasad K, 500
i.e Name and Surname in a single column in the output
In Hive if I give row format delimited fields terminated by ',' it will not work. Do we need to write a serde?
Solution can be in MR or PIG also.
Why you dont use concat function, if you dont want process data and just query the raw data, think about creating a view on it :
select ID,concat(Name ,' ' ,Surname),Sal from table;
You can use concat function.
First, You can create the table(i.e. table1) with raw data having 4 columns delimited by comma :
ID, first_name,last_name, salary
Then concat the first_name and last_name using select query and store the results in another table using CTAS(Create TABLE AS SELECT) feature
CREATE TABLE EMP_TABLE AS SELECT ID, CONCAT(first_name,' ','last_name) as NAME, salary from table1

how to manage Date interval in hive

I'm new in Hive-Hadoop. I have some problem with Date interval management.
In Postgresql, I can get the "6 days" before a given date :
select max(datejour) + INTERVAL '-6 day' as maxdate from table
e.g : if max(datejour) = 2015-08-22 ==> my query returns 2015-08-15
Does somebody can help me on how could I do it in Hive?
thanks.
You can use Hive INTERVAL to achieve this.
select (max(datejour) - INTERVAL '6' DAY) as maxdate from table
Above query should return 2015-08-15
You can find more details -
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
You can use Hive Date builtin function to achieve this
select date_sub('2015-08-22', 6) from table
Above query should return 2015-08-15
You can find more Hive built-in function here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
Hope his helps
You can use DATE_SUB function to get your requirement.
Query may look like this(in your case):
select DATE_SUB(from_unixtime(unix_timestamp(cast(MAX(t1.max_date) AS string) ,'yyyy-MM-dd'), 'yyyy-MM-dd'), 6) from (select MAX(datejour) as max_date from table) t1 group by t1.max_date;
Since updating records using UPDATE command is not possible in hive and adding columns through alter command is not recommended as you have to insert values in it through same table.
create external table test(
fields1 string,
field2 string)
create external table test(
fields1 string,
field2 string,
h01 string
)
Insert overwrite table table2
select
fields1,
field2,
case when fields1 = '' then 'OK' else 'KO' end as h01 from table1 where your_condition;

Hive multiple insert goes wrong with the DISTINCT select statement

I read this code from "Hadoop the Definitive Guide":
SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;
but as my test, using several DISTINCT cannot get right results.
my hiveql as below:
CREATE TABLE IF NOT EXISTS a (logindate int, id int);
then
load local file to this table...
CREATE TABLE IF NOT EXISTS user (id INT) PARTITIONED BY (logindate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
then
if inserting table separately:
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) FROM a WHERE logindate=20130120;
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) FROM a WHERE logindate=20130121;
the results are correct;
but if choosing the next multiple insert hql:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) WHERE logindate=20130121;
the results are not correct, both partitions have the same number of records, seems like select from DISTINCT(id) WHERE logindate=20130120 OR logindate=20130121
so is it a bug or did I write some wrong syntax?
DISTINCT has a bit of an odd history in the code as an alias to group by.
If there is a bug, then the version of hive you are using would be important to know since bugs are addressed in each release.
This might work:
FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120 GROUP BY id
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT id WHERE logindate=20130121 GROUP BY id;
if that doesn't work, this will definitely work...even though it isn't the approach you were attempting to use...
FROM (select distinct id, logindate from a where logindate in ('20130120','20130121')) subq_a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130121;

Resources