Hive select columns to do a case statement on - hadoop

This will export the data from dynamodb dynamically to s3.
-- Load S3 Table with data from DynamoDB
INSERT OVERWRITE TABLE s3_table SELECT * FROM dynamodb_table;
The problem is that it leaves in a bunch of \N. I can write it by hand it will look something like
-- Load S3 Table with data from DynamoDB
INSERT OVERWRITE TABLE s3_table SELECT DCS_ID, CASE WHEN MAKE IS NULL THEN "" ELSE MAKE END, CASE WHEN MODEL IS NULL THEN "" ELSE MODEL END FROM dynamodb_table;
The problem is selecting the columns to say "When Column is NULL Then "" Else Column End"
The current output looks like this
PORTAL 1.5.1.25.2 2013-08-09 13:45:20.126 2013-08-09 13:45:20.282 \N \N \N \N \N \N
The desired ouput looks like this
PORTAL 1.5.1.25.2 2013-08-13 18:18:24.667 2013-08-13 18:18:24.832

The hive output contains the string "\N" for null values (to distinguish from blank), so either you have to prepare each column, or process the output afterwards (could use a stream job if large amounts of data.)
I often use the coalesce function for this: coalesce takes multiple arguments and returns the first non-null (or null if all null). In your example to avoid the nulls in output, you could do the following:
INSERT OVERWRITE TABLE s3_table
SELECT coalesce(DCS_ID,''), coalesce(MAKE,''), coalesce(MODEL,'')
FROM dynamodb_table;

Related

How to replace NULL values in one column to 0 (of a very large table) without creating a new column of the desired results added to the table in HIVE?

I am trying to replace all of the NULL values to 0 in a column of a big table in HIVE.
However, every time I try to implement some code I end up generating a new column to the table. The column I am trying to change/modify still exists and still has the NULL values but the new column that is automatically generated (i.e. _c1) is what I want the column I am trying to modify, to look like.
I tried to run a COALESCE but that also ended up generating a new column. I also tried to implement a CASE WHEN, but the same results ensued.
Select *,
CASE WHEN columnname IS NULL THEN 0
ELSE columnname
END
from tablename;
Also tried
SELECT coalesce(columnname, CAST(0 AS BIGINT)) FROM tablename
I would just like to update the table with the other columns being as is but the column I want to modify still has its original name but instead of NULL values it has 0's that replaced them.
I don't want to generate a new column but modify an existing one.
How should I do that?
Use insert overwrite .. option.
insert overwrite table tablename
select c1,c2,...,coalesce(columnname,0) as columnname
from tablename
Note that you have to specify all the other column names required in select.

Replace data of one column with substring of another column in sql loader

I am loading data from a csv file into a table using sqlldr. There is one column which is not present in every row of the csv file. The data needed to populate this column is present in one of the other columns of the row. I need to split (split(.) )that column's data and populate into that column.
Like:-
column1:- abc.xyz.n
So the unknown column(column2) should be
column2:- xyz
Also, there is another column which is present in the row but it's not what I want to input into the table. It is also needed to be populated from column1. But there are around 50 if-else cases in that. Is decode preferable to do this?
column1:- abc.xyz.n
Then,
column2:- hi if(column1 has 'abc')
if(column1 has 'abd' then 'hello')
like this there are around 50 if-else cases.
Thanks for help.
For the first part of your question, define the column1 data in the control file as BOUNDFILLER with a name that does not match a table column name which tells sqlldr to remember it but don't use it. If you need to load it into a column, use the column name plus the remembered name. For column2, use the remembered BOUNDFILLER name in an expression where it returns the part you need (in this case the 2nd field, allowing for NULLs):
x boundfiller,
column1 EXPRESSION ":x",
column2 EXPRESSION "REGEXP_SUBSTR(:x, '(.*?)(\\.|$)', 1, 2, NULL, 1)"
Note the double backslash is needed else it gets removed as it gets passed to the regex engine from sqlldr and the regex pattern is altered incorrectly. A quirk I guess.
Anyway after this column1 ends up with "abc.xyz.n" and column2 gets "xyz".
For the second part of your question, you could use an expression as already shown but call a custom function you create where you pass the extracted value and it would return the searched value from a lookup table. You certainly don't want to hardcode your 50 lookup values. You could do the same thing basically in a table level trigger too. Note I show a select statement for an example only but this should be encapsulated in a function for reusability and maintainability:
Just to show you can do it:
col2 EXPRESSION "(select 'hello' from dual where REGEXP_SUBSTR(:x, '(.*?)(\\.|$)', 1, 2, NULL, 1) = 'xyz')"
The right way:
col2 EXPRESSION "(myschema.mylookupfunc(REGEXP_SUBSTR(:x, '(.*?)(\\.|$)', 1, 2, NULL, 1)))"
mylookupfunc returns the result of looking up 'xyz' in the lookup table, i.e. 'hello' as per your example.

Import CSV which every cell terminated by newline

I have CSV file. The data looks like this :
PRICE_a
123
PRICE_b
500
PRICE_c
1000
PRICE_d
506
My XYZ Table is :
CREATE TABLE XYZ (
DESCRIPTION_1 VARCHAR2(25),
VALUE NUMBER
)
Do csv as above can be imported to the oracle?
How do I create a control.ctl file?
Here's how to do it without having to do any pre-processing. Use the CONCATENATE 2 clause to tell SQL-Loader to join every 2 lines together. This builds logical records but you have no separator between the 2 fields. No problem, but first understand how the data file is read and processed. SQL-Loader will read the data file a record at a time, and try to map each field in order from left to right to the fields as listed in the control file. See the control file below. Since the concatenated record it read matches with TEMP from the control file, and TEMP does not match a column in the table, it will not try to insert it. Instead, since it is defined as a BOUNDFILLER, that means don't try to do anything with it but save it for future use. There are no more data file fields to try to match, but the control file next lists a field name that matches a column name, DESCRIPTION_1, so it will apply the expression and insert it.
The expression says to apply the regexp_substr function to the saved string :TEMP (which we know is the entire record from the file) and return the substring of that record consisting of zero or more non-numeric characters from the start of the string where followed by zero or more numeric characters until the end of the string, and insert that into the DESCRIPTION_1 column.
The same is then done for the VALUE column, only returning the numeric part at the end of the string, skipping the non-numeric at the beginning of the string.
load data
infile 'xyz.dat'
CONCATENATE 2
into table XYZ
truncate
TRAILING NULLCOLS
(
TEMP BOUNDFILLER CHAR(30),
DESCRIPTION_1 EXPRESSION "REGEXP_SUBSTR(:TEMP, '^([^0-9]*)[0-9]*$', 1, 1, NULL, 1)",
VALUE EXPRESSION "REGEXP_SUBSTR(:TEMP, '^[^0-9]*([0-9]*)$', 1, 1, NULL, 1)"
)
Bada-boom, bada-bing:
SQL> select *
from XYZ
/
DESCRIPTION_1 VALUE
------------------------- ----------
PRICE_a 123
PRICE_b 500
PRICE_c 1000
PRICE_d 506
SQL>
Note that this is pretty dependent on the data following your example, and you should do some analysis of the data to make sure the regular expressions will work before putting this into production. Some tweaking will be required if the descriptions could contain numbers. If you can get the data to be properly formatted with a separator in a true CSV format, that would be much better.

how to pass parameter to oracle update statement from csv file and excluding null values from csv

I have a situation where I have following csv file(say file.csv) with following data:
AcctId,Name,OpenBal,closingbal
1,abc,1000,
2,,0,
3,xyz,,
4,,,
how can I loop through this file using unix shell and say for example for column $2 (Name) , I want to get all occurances of Name column accept null values and pass it to for example following oracle query with single quotes '','' format?
select * from account
where name in (collection of values from csv file column name
but excluding null values)
and openbal in
and same thing for column 3 (collection of values from csv file column Openbal
but excluding null values)
and same thing for column 4 (collection of values from csv file column
closingbal but excluding null values)
In short what I want is pass the csv column values as input parameter to oracle sql query and update query too ? but again I dont want to include null values in it. If a column is entirely null for all rows I want to exclude it too?
Not sure why you'd want to loop through this file in a unix shell script: perhaps because you can't think of any better approach? Anyway, I'm going to skip that and offer a pure Oracle solution.
We can expose data in CSV files to the database using external tables. These are like regular tables except their data comes from files in OS directories on the database server (rather than the database's storage). Find out more.
Given this approach it is easy to write the query you want. I suggest using sub-query factoring to select from the external table once.
with cte as ( select name, openbal, closingbal
from your_external_tab )
select *
from account a
where a.name in ( select cte.name from cte )
and a.openbal in ( select cte.openbal from cte )
and a.closingbal in ( select cte.closingbal from cte )
The behaviour of the IN clause is to exclude NULL from consideration.
Incidentally, that will return a different (larger) result set from this:
select a.*
from account a
, your_external_table e
where a.name = e.name
and a.openbal= e.openbal
and a.closingbal = e.closingbal

Loading data into Hive table

CREATE TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '-'
collection items terminated by ','
map keys terminated by ':'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/rkost/output2';
Below is my data which is in only one row data that I need to upload it in above table.
1015826235-[{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"},{"product_id":140002997245,"timestamps":"1339694926000"},{"product_id":200002448035,"timestamps":"1339172659000"},{"product_id":260003553381,"timestamps":"1339072514000"}]-
After uploading the data when I do select query, I am not seeing data correctly. I should be getting only one row as below but I am not getting the below result in the table
**USER_ID** **PURCHASED_ITEM**
1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"}, {"product_id":300003861266,"timestamps":"1340271857000"}, {"product_id":140002997245,"timestamps":"1339694926000"}, {"product_id":200002448035,"timestamps":"1339172659000"}, {"product_id":260003553381,"timestamps":"1339072514000"}]
Instead of above data, I am getting something like this in my table data after I do select query. Anything wrong with the delimeter?
1015826235 [{"product_id":null,"timestamps":" 220003038067"},{"product_id":null,"timestamps":" \"1340321132000\"}"},{"product_id":null,"timestamps":"
300003861266"},{"product_id":null,"timestamps":" \"1340271857000\"}"},{"product_id":null,"timestamps":" 140002997245"},
{"product_id":null,"timestamps":" \"1339694926000\"}"},{"product_id":null,"timestamps":" 200002448035"},
{"product_id":null,"timestamps":" \"1339172659000\"}"},{"product_id":null,"timestamps":" 260003553381"},
{"product_id":null,"timestamps":" \"1339072514000\"}]"}]
Can anyone point me what wrong I am doing?
Add double quotes to product id
1015826235-[{"product_id":"220003038067","timestamps":"1340321132000"},{"product_id":"300003861266","timestamps":"1340271857000"},{"product_id":"140002997245","timestamps":"1339694926000"},{"product_id":"200002448035","timestamps":"1339172659000"},{"product_id":"260003553381","timestamps":"1339072514000"}]-
I have figured it out on my own. The whole data that needs to be loaded should be somehow like this-
1015826235-220003038067:1340321132000,300003861266:1340271857000,140002997245:1339694926000,200002448035:1339172659000,260003553381:1339072514000

Resources