Hive - How to query a table to get its own name? - hadoop

I want to write a query such that it returns the table name (of the table I am querying) and some other values. Something like:
select table_name, col1, col2 from table_name;
I need to do this in Hive. Any idea how I can get the table name of the table I am querying?
Basically, I am creating a lookup table that stores the table name and some other information on a daily basis in Hive. Since Hive does not (at least the version we are using) support full-fledged INSERTs, I am trying to use the workaround where we can INSERT into a table with a SELECT query that queries another table. Part of this involves actually storing the table name as well. How can this be achieved?

For the purposes of my use case, this will suffice:
select 'table_name', col1, col2 from table_name;
It returns the table name with the other columns that I will require.

Related

How to create table in Hive with specific column values from another table

I am new to Hive and have some problems. I try to find a answer here and other sites but with no luck... I also tried many different querys that come to my mind, also without success.
I have my source table and i want to create new table like this.
Were:
id would be number of distinct counties as auto increment numbers and primary key
counties as distinct names of counties (from source table)
You could follow this approach.
A CTAS(Create Table As Select)
with your example this CTAS could work
CREATE TABLE t_county
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE AS
WITH t AS(
SELECT DISTINCT county, ROW_NUMBER() OVER() AS id
FROM counties)
SELECT id, county
FROM t;
You cannot have primary key or foreign keys on Hive as you have primary key on RBDMSs like Oracle or MySql because Hive is schema on read instead of schema on write like Oracle so you cannot implement constraints of any kind on Hive.
I can not give you the exact answer because of it suppose to you must try to do it by yourself and then if you have a problem or a doubt come here and tell us. But, what i can tell you is that you can use the insertstatement to create a new table using data from another table, I.E:
create table CARS (name string);
insert table CARS select x, y from TABLE_2;
You can also use the overwrite statement if you desire to delete all the existing data that you have inside that table (CARS).
So, the operation will be
CREATE TABLE ==> INSERT OPERATION (OVERWRITE?) + QUERY OPERATION
Hive is not an RDBMS database, so there is no concept of primary key or foreign key.
But you can add auto increment column in Hive. Please try as:
Create table new_table as
select reflect("java.util.UUID", "randomUUID") id, countries from my_source_table;

How to populate columns of a new hive table from multiple existing tables?

I have created a new table in hive (T1) with columns c1,c2,c3,c4. I want to populate data into this table by querying from other existing tables(T2,T3).
E.g c1 and c2 come from a query run on T2 & the other columns c3 and c4 come from a query run on T3.
Is this possible in hive ? I have done immense research but still am unable to find a solution to this
Didn't something like this work?
create table T1 as
select t2.c1, t2.c2, t3.c3, t3.c4 from (some query against T2) t2 JOIN (some query against T3) t3
Obviously replace JOIN with whatever is needed. I assume some join between T2 and T3 is possible or else you wouldn't be putting their columns alongside each other in T1.
According to the hive documentation, you can use the following syntax to insert data:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Be careful that:
Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.
So, I would make a JOIN between the two existing table, and then insert only the needed values in the target table playing around with SELECT. Or maybe creating a temporary table would allow you to have more control over the data. Just remember to handle the problem with NULL, as stated in the official documentation. This is just an idea, I guess there are other ways to achieve what you need, but could be a good place to start from.

Hive load specific columns

I am interested in loading specific columns into a table created in Hive.
Is it possible to load the specific columns directly or I should load all the data and create a second table to SELECT the specific columns?
Thanks
Yes you have to load all the data like this :
LOAD DATA [LOCAL] INPATH /Your/Path [OVERWRITE] INTO TABLE yourTable;
LOCAL means that your file is on your local system and not in HDFS, OVERWRITE means that the current data in the table will be deleted.
So you create a second table with only the fields you need and you execute this query :
INSERT OVERWRITE TABLE yourNewTable
yourSelectStatement
FROM yourOldTable;
It is suggested to create an External Table in Hive and map the data you have and then create a new table with specific columns and use the create table as command
create table table_name as select statement from table_name;
For example the statement looks like this
create table employee as select id as id,emp_name as name from emp;
Try this:
Insert into table_name
(
#columns you want to insert value into in lowercase
)
select columns_you_need from source_table;

Hive: Creating smaller table from big table

I currently have a Hive table that has 1.5 billion rows. I would like to create a smaller table (using the same table schema) with about 1 million rows from the original table. Ideally, the new rows would be randomly sampled from the original table, but getting the top 1M or bottom 1M of the original table would be ok, too. How would I do this?
As climbage suggested earlier, you could probably best use Hive's built-in sampling methods.
INSERT OVERWRITE TABLE my_table_sample
SELECT * FROM my_table
TABLESAMPLE (1m ROWS) t;
This syntax was introduced in Hive 0.11. If you are running an older version of Hive, you'll be confined to using the PERCENT syntax like so.
INSERT OVERWRITE TABLE my_table_sample
SELECT * FROM my_table
TABLESAMPLE (1 PERCENT) t;
You can change the percentage to match you specific sample size requirements.
You can define a new table with the same schema as your original table.
Then use INSERT OVERWRITE TABLE <tablename> <select statement>
The SELECT statement will need to query your original table, use LIMIT to only get 1M results.
This query will pull out top 1M rows and overwrite them in a new table.
CREATE TABLE new_table_name AS
SELECT col1, col2, col3, ....
FROM original_table
WHERE (if you want to put any condition) limit 100000;

Hive: Create New Table from Existing Partitioned Table

I'm using Amazon's Elastic MapReduce and I have a hive table created based on a series of log files stored in Amazon S3 and split in folders by day like so:
data/day=2011-09-01/log_file.tsv
data/day=2011-09-02/log_file.tsv
I am currently trying to create an additional table which filters out some unwanted activity in these log files but I can't figure out how to do this and keep getting errors such as:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
If my initial table create statement looks something like this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
... fields ...
)
PARTITIONED BY ( DAY STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/data/';
That initial table works fine and I've been able to query it with no problems.
How then should I create a new table that shares the structure of the previous one but simply filters out data? This doesn't seem to work.
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
FROM table1
INSERT OVERWRITE TABLE table2
SELECT * WHERE
col1 = '%somecriteria%' AND
more criteria...
;
As I've stated above, this returns:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
Thanks!
This always works for me:
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
INSERT OVERWRITE TABLE table2 PARTITION (day) SELECT col1, col2, ..., day FROM table1;
ALTER TABLE table2 RECOVER PARTITIONS;
Notice that I've added 'day' as a column in the SELECT statement. Also notice that there is an ALTER TABLE line which is necessary for Hive to become aware of the partitions that were newly created in table2.
I have never used the like option.. so thanks for showing me that. Will that actually create all of the partitions that the first table has as well? If not, that could be the issue. You could try using dynamic partitions:
create external table if not exists table2 like table1;
insert overwrite table table2 partition(part) select col1, col2 from table1;
Might not be the best solution, as I think you have to specify your columns in the select clause (as well as the partition column in the partition clause).
And, you must turn on dynamic partitioning.
I hope this helps.

Resources