Hive not creating separate directories for skewed table - hadoop

My hive version is 1.2.1. I am trying to create a skewed table but it clearly doesn't seem to be working. Here is my table creation script:-
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable
(
country string,
payload string
)
PARTITIONED BY (year int,month int,day int,hour int)
SKEWED BY (country) on ('USA','Brazil') STORED AS DIRECTORIES
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE mydb.mytable PARTITION(year = 2019, month = 10, day=05, hour=18)
SELECT country,payload FROM mydb.mysource;
The select query returns names of countries and some associated string data (payload). So, based on the way I have specified skewing on the column 'country' I was expecting the insert statement to cause creation of separate directories for USA & Brazil (the select query returns enough rows with country as USA & Brazil), but this clearly didn't happen. I see that hive created directory called 'HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME' and all the values went into a single file in that directory. Skewed table is only supposed to send rows with default values (those not specified in table creation statement) to common directory (which is what HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME seems to be) and should create dedicated directories for the rows with skew values. But instead all is going to the default directory and the other directory isn't even created. Do I have to toggle any hive options to make this thing work?

It looks like old bug, doesn't look like it's fixed yet. https://issues.apache.org/jira/browse/HIVE-13697. Basically internally when Hive stores these skew values specified during the table creation, they are converted to lower case before storing in the metastore. That's why the workaround for now is to convert case in the select statement so it goes to the right bucket. I tested this and this way it works.

Related

Deduplication in Oracle

Situation:-
Table 'A' is receiving data from OracleGoldenGate feed and gets the data as New,Updated,Duplicate feed that either creates a new record or rewrites the old one based on it's characteristics (N/U/D). Every entry in table has its UpdatedTimeStamp column contain insertion timestamp.
Scope:-
To write a StoredProcedure in Oracle that pulls the data for a time period based on UpdatedTimeStamp column and publishes an xml using DBMSXMLGEN.
How can I ensure that a duplicate entered in the table is not processed again ??
FYI-am currently filtering via a new table that I created, named as 'A-stg' and has old data inserted incrementally.
As far as I understood the question, there are a few ways to avoid duplicates.
The most obvious is to use DISTINCT, e.g.
select distinct data_column from your_table
Another one is to use timestamp column and get only the last (or the first?) value, e.g.
select data_column, max(timestamp_column)
from your_table
group by data_column

How to compare table data structure

How to compare table data structure.
1. Any table added or deleted.
2. Any column in the tables added or deleted.
So my job is to verify if any table or columns are added/deleted on 1st of every month.
My plan is to run a sql query and take a copy of entire list of tables and it's data type only (NO DATA) and save it in txt file or something and use it as base line, and next month run the same sql query and get the results and compare the file. is it possible? please help with the sql query which can do this job.
This query will give you a list of all tables and their columns for a given user (just replace ABCD in this query for the user you have to audit and providing you have access to all that users tables this will work).
SELECT table_name,
column_name
FROM all_tab_columns
WHERE owner = 'ABCD'
ORDER
BY table_name,
column_id;
This answers your question but I have to agree with a_horse_with_no_name that is not a good way implement change control, most notably because the changes have already happened.
This query is very basic and doesn't give you all the information you'd need to see if a column has changed (or any information about other objects types etc), but then you only asked about additions and deletions of tables and columns and you can compare the output of this script to previous outputs to find the answer to your allotted task.

hive: fixed log structure and daily analysis

i'm new to hive. i have logs stored in folders by date: logs/2016/02/15/log-xxx.json. i want to do a daily analysis on logs from the last day. i wan't to run a hiveQL on last 2-3 folders (timezone difference). how to do it efficiently?
i cannot tell hive to automatically discover new logs and add them as new partitions, right?
do i have to create external table before each query and later delete it?
is there any way to tell hive to just run the query on specified folders without creating any table?
You can create database with partitions based on date.
CREATE EXTERNAL TABLE user (
userId BIGINT,
type INT,
level TINYINT,
date String
)
PARTITIONED BY (date String)
Hive uses schema-on-read - that means if you will add files later manually to HDFS it will automatically take them into account during SELECT statement execution.
Just put them into proper location according to the date
BUT, one moment you should take into account:
Because when external table is declared, default table path is
changed to specified location in hive metadata which contains in
metastore, but about partition, nothing is changed, so, we must
manually add those metadata.
ALTER TABLE user ADD PARTITION(date='2010-02-22');
Read more here: http://blog.zhengdong.me/2012/02/22/hive-external-table-with-partitions/
Author of that post also provides script to automate adding of partitions.

Set ORC file name

I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando

How to alter Hive partition column name

I have to change the partition column name (not partition spec), I looked for the commands in hive wiki and some google pages. I can find the options for altering the partition spec,
i.e. For example
In /table/country='US' I can change US to USA, but I want to change country to continent.
I feel like the only option available for changing partition column name is dropping and re-creating the table. Is there is any other option available please help me.
Thanks in advance.
You can change column name in metadata by following:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment
But as the document says, it only changes the metadata. Hive partitions are implemented as directories with the naming pattern columnName=spec. So you also need to change the names of those directories on HDFS by using "hadoop fs" command.
You have alter the partition column using simple swap method.
Create a new temp table which is same schema as current table.
Move all files in the old table to newly create table location.
hadoop fs -mv <current_table_name> <temp_table_name>
Alter the schema of the original table (Rename or drop the partitions)
Recopy/load the temp table data to the original table with appropriate partition values.
hadoop fs -mv <temp_table_name> <current_table_name>
msck repair the the original table & drop the temp_table.
NOTE : mv command move the file from one location to another with reducing the copy time. alternately we can use LOAD DATA INPATH for copy the data to the original table.
You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
You can think of it this way - Hive stores the data by creating a folder in hdfs with partition column values - Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/**2009**/file2
tab1/clientdata/**2010**/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table
As you said, rename the value for of the partition is very straightforward:
hive> ALTER TABLE test.usage PARTITION (country ='US') RENAME TO PARTITION (date='USA');
I know that this is not what you are looking for. Unfortunately, given that your data is already partitioned by country, the only option you have is to drop the table, remove the data (supposing your table is external) from the HDFS and reinsert the data using continent as partition.
What I would do in your case is to have multiple partition levels, so that your folder structure will look like that:
/path/to/the/data/continent='america'/country='usa'
/path/to/the/data/continent='america'/country='mexico'
/path/to/the/data/continent='europe'/country='spain'
/path/to/the/data/continent='europe'/country='italy'
...
That way you can query the data for different levels of granularity (in this case continent and country).
Adding solution here for later:
Use case: Change partition column from STRING to INT
set hive.mapred.mode=norestrict;
alter table {table_name} partition column ({column_name} {column_type});
e.g. ALTER TABLE employee PARTITION COLUMN dept INT;

Resources