Hive : inserting into multiple tables based on query result - hadoop

I am trying to run a hive query to filter invalid records. Here is what I am doing
1. Load the csv file into a single column table.
2. define a UDF my_validation to validate each record
3. execute the query
from pgstg INSERT OVERWRITE LOCAL DIRECTORY '/tmp/validrecords.out'
select * where my_validation(record) IS NOT NULL
INSERT OVERWRITE TABLE PGERR
select record where my_validation(record) IS NULL;
Here are my questions:
a. Is there a better way to filter invalid records;
b. Does the my_validation UDF run twice on the whole table ?
c. what is the best way to split a single column to multiple column.
Thanks much for your help.

To answer your questions:
1) If you have custom validation criteria UDF is probably the way to go. If I were doing it, I would create an is_valid UDF that returns a boolean (instead of returning NULL vs. not NULL).
2) Yes, the UDF does get run twice.
3) Glad you asked. Look at the explode function available in Hive

Related

PL/SQL: Looping through a list string

Please forgive me if I open a new thread about looping in PL/SQL but after reading dozens of existing ones I'm still not able to perform what I'd like to.
I need to run a complex query on a view of a table and the only way to shorten running time is to filter through a where clause based on a variable to which such table is indexed (otherwise the system ends up doing a full scan of the table which runs endlessly)
The variable the table is indexed on is store_id (string)
I can retrieve all the store_id I want to query from a separate table:
e.g select distinct store_id from store_anagraphy
Then I'd like to make a loop that iterate queries with the store_id identified above
e.g select *complex query from view_of_sales where store_id = 'xxxxxx'
and append (union) all the result returned by each of this queries
Thank you very much in advance.
Gianluca
In theory, you could write a pipelined table function that ran multiple queries in a loop and made a series of pipe row calls to return the results. That would be pretty unusual but it could be done.
It would be far, far more common, however, to simply combine the two queries and run a single query that returns all the rows you want
select something
from your_view
where store_id in (select distinct store_id
from store_anagraphy)
If you are saying that you have tried this query and Oracle is choosing to do a table scan rather than using the index then what you really have is a tuning problem. Most likely, statistics on one or more objects are inaccurate which leads Oracle to expect that this query would return more rows than it really will thus favoring the table scan. You should be able to fix that by fixing the statistics on the objects. In a pinch, you could also use hints to force an index to be used.

Concat_ws not working in insert statement in hive

Using hive, I'm trying to concatenate columns from one table and insert them in another table using the query
insert into table temp_error
select * from (Select 'temp_test','abcd','abcd','abcd',
from_unixtime(unix_timestamp()),concat_ws('|',sno,name,age)
from temp_test_string)c;
I get the required output till I use Select *. But as soon as I try to insert it into the table, it does not give concatenated output but gives the value of sno only instead of whole concatenated output.
Thanks guys.
I found why it was behaving that way. It's because while creating table I gave "separate fields by '|'". So what I was trying to insert as a string into the table, hive was interpreting it as different columns.

Why Phoenix always add a extra column (named _0) to hbase when I execute UPSERT command?

When I execute the UPSERT command on apache phoenix, I always see that Phoenix add an extra column (named _0) with an empty value in the hbase, this column(_0) is auto generate by phoenix, but I don't need it, like this:
ROW COLUMN+CELL
abc column=F:A,timestamp=1451305685300,value=123
abc column=F:_0, timestamp=1451305685300, value=  # I want to avoid generate this row
Could you tell me how to avoid that? Thank you very much!
"At create time, to improve query performance, an empty key value is
added to the first column family of any existing rows or the default
column family if no column families are explicitly defined. Upserts will also add this empty key value. This improves query performance by having a key value column we can guarantee always being there and thus minimizing the amount of data that must be projected and subsequently returned back to the client."
Apache Phoenix Documentation
Regarding your question if that is avoidable:
You could work around the problem by adding the following statements at the end of your sql:
ALTER TABLE "<your-table>" ADD "<your-cf>"."_0" VARCHAR(1);
ALTER TABLE "<your-table>" DROP COLUMN "<your-cf>"."_0";
You should only do this if you query some table with phoenix but then access the table with another system that is not aware of this phoenix-specific dummy value.

Hive count(*) query is not invoking mapreduce

I have external tables in hive, I am trying to run select count(*) from table_name query but the query returns instantaneously and gives result which is i think already stored. The result returned by query is not correct. Is there a way to force a map reduce job and make the query execute each time.
Note: This behavior is not followed for all external tables but some of them.
Versions used : Hive 0.14.0.2.2.6.0-2800, Hadoop 2.6.0.2.2.6.0-2800 (Hortonworks)
After some finding I have got a method that kicks off MR for counting number of records on orc table.
ANALYZE TABLE 'table name' PARTITION('partition columns') COMPUTE STATISTICS;
--OR
ANALYZE TABLE 'table name' COMPUTE STATISTICS;
This is not a direct alternative for count(*) but provides latest count of records in the table.
Doing a wc -l on ORC data won't give you an accurate result, since the data is encoded. This would work if the data was stored in a simple text file format with one row per line.
Hive does not need to launch a MapReduce for count(*) of an ORC file since it can use the ORC metadata to determine the total count.
Use the orcfiledump command to analyse ORC data from the command line
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileDumpUtility
From personal experience, COUNT(*) on an ORC table usually returns wrong figures -- i.e. it returns the number of rows on the first data file only. If the table was fed by multiple INSERTs then you are stuck.
With V0.13 you could fool the optimizer into running a dummy M/R job by adding a dummy "where 1=1" clause -- takes much longer, but actually counts the rows.
With 0.14 the optimizer got smarter, you must add a non-deterministic clause e.g. "where MYKEY is null". Assuming that MYKEY is a String, otherwise the "is null" clause may crash your query -- another ugly ORC bug.
By the way, a SELECT DISTINCT on partition key(s) will also return wrong results -- all existing partitions will be shown, even the empty ones. Not specific to ORC this time.
please try the below :
hive>set hive.fetch.task.conversion=none in your hive session and then trigger select count(*) operation in your hive session to mandate mapreduce

Oracle Trigger UPDATE instead of INSERT

Trying to find a way to write an Oracle trigger that would check before an insert to see if a match was found in the primary column and if so update the row information instead of inserting a new row.
I've looked at before insert. Is there a way to cancel the insert based on criteria inside that block?
I've also looked at using the instead of clause but it requires working on a view.
What is the best way to go about this?
Use a MERGE statement instead of an INSERT.
Use a merge statement.
MERGE INTO <<your table>> t
USING (<<your list of records - can be the result of a SELECT >>)
ON ( <<join between table and list of records >>)
WHEN MATCHED THEN
UPDATE SET << the rows you want to set>>
WHEN NOT MATCHED THEN
INSERT (<<columns of table>>)
VALUES (<<value>>)
https://oracle-base.com/articles/9i/merge-statement

Resources