Hive solution to select/treat null string as NULL - hadoop

I have a Hive external table with csv data in it. Some of the string fields have value as 'null'. Now, I want to select the data and insert into other table in ORC format with query like 'select * from first insert into second'.
I want to replace the string 'null' with actual NULL value.
One solutions could be replace 'null' with blank and design my table to treat blank as null. That may work. But, if there are any blank values present in data, those will be also treated as NULL.
Other point comes to my mind is, the table has large number of columns with such strings.So if the solution requires to select a column and perform some operation; I will have to write a very long query. But if there is no other option, that can be done.
Please suggest a solution.

All you need to do is to alter your external table so it will treat null string as NULL
alter table my_external_table set tblproperties('serialization.null.format'='null');

The more recent versions of Hive support the standard NULLIF() function. If you are using insert, then you should be listing the columns anyway:
insert into second(col1, col2, col3, . . .)
select col1, nullif(col2, 'null'), col3, . . .
from first;

Related

Is insert statement without column approve performance in Oracle

I'm using Oracle 12.
When you define insert statement there is option not to state column list
If you omit the column list altogether, then the values_clause or query must specify values for all columns in the table.
Also it's describe in Ask TOM when suggesting a best performance solution for bulk:
insert into insert_into_table values ( data(i) );
My question, is not stating columns really produce a better or at least equal performance than stating column in statement as
insert table A (col1, col2, col3) values (?, ?, ?);
From my experience there is no gain in omitting column names - also it's a bad practice to do so, since if column order changes (sometimes people do that, for clarity, but they really don't need to) and their definition allows to insert the data, you will get wrong data in wrong columns.
As a rule of thumb it's not worth the trouble. Always specify column list that you're putting values into. Database has to check that anyways.
Related: SQL INSERT performance omitting field names?
Best practice is, that you ALWAYS define the columns in insert statements, NOT for the performance sake(there is no difference), but for situation like this:
You create table test1 with columns col1, col2;
You insert data in that table in your procedures/etc, without naming the columns;
You add new columns, col3, col4;
The current logic will fail with insert, errors will be raised.
So, to avoid the failure, always name the columns, then your code doesn't brake,when you modify the table structure.

Concat_ws not working in insert statement in hive

Using hive, I'm trying to concatenate columns from one table and insert them in another table using the query
insert into table temp_error
select * from (Select 'temp_test','abcd','abcd','abcd',
from_unixtime(unix_timestamp()),concat_ws('|',sno,name,age)
from temp_test_string)c;
I get the required output till I use Select *. But as soon as I try to insert it into the table, it does not give concatenated output but gives the value of sno only instead of whole concatenated output.
Thanks guys.
I found why it was behaving that way. It's because while creating table I gave "separate fields by '|'". So what I was trying to insert as a string into the table, hive was interpreting it as different columns.

Hive - How to query a table to get its own name?

I want to write a query such that it returns the table name (of the table I am querying) and some other values. Something like:
select table_name, col1, col2 from table_name;
I need to do this in Hive. Any idea how I can get the table name of the table I am querying?
Basically, I am creating a lookup table that stores the table name and some other information on a daily basis in Hive. Since Hive does not (at least the version we are using) support full-fledged INSERTs, I am trying to use the workaround where we can INSERT into a table with a SELECT query that queries another table. Part of this involves actually storing the table name as well. How can this be achieved?
For the purposes of my use case, this will suffice:
select 'table_name', col1, col2 from table_name;
It returns the table name with the other columns that I will require.

Hive: Create New Table from Existing Partitioned Table

I'm using Amazon's Elastic MapReduce and I have a hive table created based on a series of log files stored in Amazon S3 and split in folders by day like so:
data/day=2011-09-01/log_file.tsv
data/day=2011-09-02/log_file.tsv
I am currently trying to create an additional table which filters out some unwanted activity in these log files but I can't figure out how to do this and keep getting errors such as:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
If my initial table create statement looks something like this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
... fields ...
)
PARTITIONED BY ( DAY STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/data/';
That initial table works fine and I've been able to query it with no problems.
How then should I create a new table that shares the structure of the previous one but simply filters out data? This doesn't seem to work.
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
FROM table1
INSERT OVERWRITE TABLE table2
SELECT * WHERE
col1 = '%somecriteria%' AND
more criteria...
;
As I've stated above, this returns:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
Thanks!
This always works for me:
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
INSERT OVERWRITE TABLE table2 PARTITION (day) SELECT col1, col2, ..., day FROM table1;
ALTER TABLE table2 RECOVER PARTITIONS;
Notice that I've added 'day' as a column in the SELECT statement. Also notice that there is an ALTER TABLE line which is necessary for Hive to become aware of the partitions that were newly created in table2.
I have never used the like option.. so thanks for showing me that. Will that actually create all of the partitions that the first table has as well? If not, that could be the issue. You could try using dynamic partitions:
create external table if not exists table2 like table1;
insert overwrite table table2 partition(part) select col1, col2 from table1;
Might not be the best solution, as I think you have to specify your columns in the select clause (as well as the partition column in the partition clause).
And, you must turn on dynamic partitioning.
I hope this helps.

Difference Between Insert and Append statement in SQL Loader?

Can any one tell me the Difference Between Insert and Append statement in SQL Loader?consider the below example :
Here is my control file
load_1.ctl
load data
infile 'load_1.dat' "str '\r\n'"
insert*/+append/* into table sql_loader_1
(
load_time sysdate,
field_2 position( 1:10),
field_1 position(11:20)
)
Here is my data file
load_1.dat
0123456789abcdefghij
**********##########
foo bar
here comes a very long line
and the next is
short
The documentation is fairly clear; use INSERT when you're loading into an empty table, and APPEND when adding rows to a table that (might) contains data (that you want to keep).
APPEND will still work if your table is empty. INSERT might be safer if you're expecting the table to be empty, as it will error if that isn't true, possibly avoiding unexpected results (particularly if you don't notice and don't get other errors like unique index constraint violations) and/or a post-load data cleanse.
The difference are in two points clear:
append will only add the record if at the end of statement
insert will insert anywhere you want i.e if your table have 10 column you can insert in 5 column only but in append you can't.
in append both your data and the table should have same columns means insert data in row level rather than in column level
and it's also true you cannot use insert if your table have data if it's empty then only you can do use insert.
hope it helps

Resources