how to get parquet file name based on data in table - hadoop

I'm trying to figure out in which of the many parquet files is the data stored in table for a particular set of date condition.
For example:
select filenames from table where dateCol = '1-1-2010';
I remember this reading somewhere that it's possible but couldn't recollect anything as such; neither could I find it elsewhere. Anybody got any ideas?

Got it.
select distinct(INPUT__FILE__NAME) from table where conditions;

Related

Hive not creating separate directories for skewed table

My hive version is 1.2.1. I am trying to create a skewed table but it clearly doesn't seem to be working. Here is my table creation script:-
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable
(
country string,
payload string
)
PARTITIONED BY (year int,month int,day int,hour int)
SKEWED BY (country) on ('USA','Brazil') STORED AS DIRECTORIES
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE mydb.mytable PARTITION(year = 2019, month = 10, day=05, hour=18)
SELECT country,payload FROM mydb.mysource;
The select query returns names of countries and some associated string data (payload). So, based on the way I have specified skewing on the column 'country' I was expecting the insert statement to cause creation of separate directories for USA & Brazil (the select query returns enough rows with country as USA & Brazil), but this clearly didn't happen. I see that hive created directory called 'HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME' and all the values went into a single file in that directory. Skewed table is only supposed to send rows with default values (those not specified in table creation statement) to common directory (which is what HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME seems to be) and should create dedicated directories for the rows with skew values. But instead all is going to the default directory and the other directory isn't even created. Do I have to toggle any hive options to make this thing work?
It looks like old bug, doesn't look like it's fixed yet. https://issues.apache.org/jira/browse/HIVE-13697. Basically internally when Hive stores these skew values specified during the table creation, they are converted to lower case before storing in the metastore. That's why the workaround for now is to convert case in the select statement so it goes to the right bucket. I tested this and this way it works.

How to compare table data structure

How to compare table data structure.
1. Any table added or deleted.
2. Any column in the tables added or deleted.
So my job is to verify if any table or columns are added/deleted on 1st of every month.
My plan is to run a sql query and take a copy of entire list of tables and it's data type only (NO DATA) and save it in txt file or something and use it as base line, and next month run the same sql query and get the results and compare the file. is it possible? please help with the sql query which can do this job.
This query will give you a list of all tables and their columns for a given user (just replace ABCD in this query for the user you have to audit and providing you have access to all that users tables this will work).
SELECT table_name,
column_name
FROM all_tab_columns
WHERE owner = 'ABCD'
ORDER
BY table_name,
column_id;
This answers your question but I have to agree with a_horse_with_no_name that is not a good way implement change control, most notably because the changes have already happened.
This query is very basic and doesn't give you all the information you'd need to see if a column has changed (or any information about other objects types etc), but then you only asked about additions and deletions of tables and columns and you can compare the output of this script to previous outputs to find the answer to your allotted task.

Set ORC file name

I'm currently implementing ETL (Talend) of monitoring data to HDFS, and Hive table.
I am now facing concerns about duplicates. More in details, if we need to run one ETL Job 2 times with the same input, we will end up with duplicates in our Hive table.
The solution to that in RDMS would have been to store the input file name and to "DELETE WHERE file name=..." before sending the data. But Hive is not a RDBMS, and does not support deletes.
I would like to have your advice on how to handle this. I envisage two solutions :
Actually, the ETL is putting CSV files to the HDFS, which are used to feed an ORC table with a "INSERT INTO TABLE ... SELECT ..." The problem is that, with this operation, I'm losing the file name, and the ORC file is named 00000. Is it possible to specify the file name of this created ORC file ? If yes, I would be able to search the data by it's file name and delete it before launching the ETL.
I'm not used to Hive's ACID capability (feature on Hive 0.14+). Would you recommend to enable ACID with Hive ? Will I be able to "DELETE WHERE" with it ?
Feel free to propose should you have any other solution to that.
Bests,
Orlando
If the data volume in target table is not too large, I would advise
INSERT INTO TABLE trg
SELECT ... FROM src
WHERE NOT EXISTS
(SELECT 1
FROM trg x
WHERE x.key =src.key
AND <<additional filter on target to reduce data volume>>
)
Hive will automatically rewrite the correlated sub-query into a MapJoin, extracting all candidate keys in target table into a Java HashMap, and filtering source rows on-the-fly. As long as the HashMap can fit in the RAM available for Mappers heap size (check your default conf files, increase with a set command in Hive script if necessary) the performance will be sub-optimal, but you can be pretty sure that you will not have any duplicate.
And in your actual use case you don't have to check each key but only a "batch ID", more precisely the original file name; the way I've done it in my previous job was
INSERT INTO TABLE trg
SELECT ..., INPUT__FILE__NAME as original_file_name
FROM src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM trg x
WHERE x.INPUT__FILE__NAME =src.original_file_name
AND <<additional filter on target to reduce data volume>>
)
That implies an extra column in your target table, but since ORC is a columnar format, it's the number of distinct values that matter -- so that the overhead would stay low.
Note the explicit "DISTINCT" in the sub-query; a mature DBMS optimizer would automatically do it at execution time, but Hive does not (not yet) so you have to force it. Note also the "1" is just a dummy value required because of "SELECT" semantics; again, a mature DBMS would allow a dummy "null" but some versions of Hive would crash (e.g. with Tez in V0.14) so "1" or "'A'" are safer.
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries#LanguageManualSubQueries-SubqueriesintheWHEREClause
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
I'm answering myself. I found a solution :
I partitionned my table with (date,input_file_name) (note, I can get the input_file_name with SELECT INPUT__FILE__NAME in Hive.
Once I did this, before running the ETL, I can send to Hive an ALTER TABLE DROP IF EXISTS PARTITION (file_name=...) so that the folder containing the input data is deleted if this INPUT_FILE has already been sent to the ORC table.
Thank you everyone for your help.
Cheers,
Orlando

In Hive, how do I load only part of the raw data to a table?

I've got a typical CREATE TABLE statement as follows:
CREATE EXTERNAL TABLE temp_url (
MSISDN STRING,
TIMESTAMP STRING,
URL STRING,
TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/20140811/';
Where /20140811/ is a directory with gigabytes worth of data inside.
Loading the things is not a problem. Querying anything on it, however, chokes Hive up and simply gives me a number of MapRed errors.
So instead, I'd like to ask if there's a way to load only part of the data in /20140811/. I know I can select a few files from inside the folder, dump them into another folder, and use that, but it seems tedious, especially when I've got 20 or so of this /20140811/ directories.
Is there something like this:
CREATE EXTERNAL TABLE temp_url (
MSISDN STRING,
TIMESTAMP STRING,
URL STRING,
TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/Half_of_20140811/';
I'm also open to non-hive answers. Perhaps there's a way in s3cmd to quickly get a certain amount of data inside /20140811/ dump it into /20140811_halved/ or something.
Thanks.
I would suggest the following as a workaround :
Create a temp table with same structure. (using like)
insert into NEW_TABLE select * from OLD_TABLE limit 1000;
You add as many filter conditions to filter out data and load.
Hope this helps you.
Since you are saying that you have "20 or so of this /20140811/ directories", why don't you try creating an external table with partitions on those directories and run your queries on a single partition.

Replace quotes using lazy simple serde hive

Hi I am dealing with many files which has quotes in the data as shown below.
"ID"|"STUDENT"|"GRADE"
"123"|"John"|"9.7"
"132"|"Johny"|"8.7"
"143"|"Ronny"|"8.17"
I would like to remove quotes from data can you please let me know how it can be done. If at all using any built in serdes will be helpfull. Since I am dealing with many such file.
Load this data as such into a temp hive table . Then use regex_replace() function while inserting into your table.
steps :
load data into a temp table with similar schema.
Insert overwrite into the final table with regex_replace().
insert overwrite table select regexp_replace(COLUMN_NAME_1,"\"",""),regexp_replace(COLUMN_NAME_2,"\"","") from temp_hive_table;
Updated :
For many files.
Define the temp table as an external table.
Copy all your source files to this hdfs path.
Do insert overwrite with regex_replace() into the desired table.
Hope this approach helps.

Resources