Creating view in HIVE - hadoop

I want to create a view on a hive table which is partitioned . My view definition is as below:
create view schema.V1 as select t1.* from scehma.tab1 as t1 inner join (select record_key ,max(last_update) as last_update from scehma.tab1 group by record_key) as t2 on t1.record_key=t2.record_key and t1.last_update=t2.last_update
My table of tab1 is partitioned on quarter_id.
When i run any query on the view it gives error:
FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "V1:t2:tab1" Table "tab1"
Regards
Jayanta Layak

Your Hive settings must be set to execute jobs in Strict mode (Default in Hive 2.x). This prevents queries of partitioned tables without a WHERE clause that filters on partitions.
If you need to run a query across all partitions(full table scan) you can set the mode to
'nonstrict'. Use this property with care as it triggers enormous mapreduce jobs.
set hive.mapred.mode=nonstrict;
If you don't need an entire table scan, you can simply specify the partition value in your query's WHERE clause.

Related

Hive partitioned view not showing partitions info

I have created a partitioned view in Hive as below
create view if not exists view_name
PARTITIONED ON(date)
as
select col1,col2,date
from table1
union all
select col1,col2,date
from table2
The underlying tables are partitioned on 'date' column. When I use DESCRIBE FORMATTED VIEW_NAME I could see the partitions information as null as showin in screenshot.
enter image description here
If I use SHOW CREATE TABLE View_Name, I get view definition without partitions as below
create view if not exists view_name
as
select col1,col2,date
from table1
union all
select col1,col2,date
from table2
Please let me know what I am missing
From the hive documentation
Although there is currently no connection between the view partition
and underlying table partitions, Hive does provide dependency
information as part of the hook invocation for ALTER VIEW ADD
PARTITION. It does this by compiling an internal query of the form
in the other words, there is no partition information available in the views about the underlying tables. A workaround (depending how complex is your view query) is add the partitions as follow
ALTER VIEW view_name ADD [IF NOT EXISTS] partition_spec partition_spec
At least from the user perspective, it will provide information about the available partitions in the underlying tables.

select all but few columns in impala

Is there a way to replicate the below in impala?
SET hive.support.quoted.identifiers=none
INSERT OVERWRITE TABLE MyTableParquet PARTITION (A='SumVal', B='SumOtherVal') SELECT `(A)?+.+` FROM MyTxtTable WHERE A='SumVal'
Basically I have a table in hive as text with 1000 fields, and I need a select that drops off the field A. The above works for Hive but now impala, how can I do this in impala without specifying all other 999 fields directly?

Import selected data from oracle db to S3 using sqoop and create hive table script on AWS EMR with selected data

I am new to big data technologies. I am working on below requirement and need help to make my work simpler.
Suppose i have 2 tables in oracle db and each table has 500 columns in it. my task is to move the selected columns data from both the tables (by join query) to AWS S3 and populate the data in hive table on AWS-EMR.
Currently to full-fill my requirement i follow below steps.
Creating external hive table on AWS-EMR with the selected columns. I know the column names but to identify the column data type for hive, i am going to oracle database tables and identifying the type of column in oracle and creating the hive script.
Once table is created, i am writing sqoop import command with selected query data and giving directory directory to S3.
Repair the table from the S3 data.
To explain in details,
Suppose T1 and T2 are two tables, T1 has 500 columns from T1_C1 to T1_C500 with various data type (Number, Varchar, Date) etc. Similarly T2 also has 500 columns from T2_C1 to T2_C500.
Now suppose i want to move some columns for ex: T1_C23,T1_C230,T1_C239,T2_C236,T1_C234,T2_C223 to S3 and create the hive table for selected columns and to know the data type i need to look into T1 and T2 table schema.
Is there any simpler way to achieve this ?
In above mentioned steps, First step takes lot of manual time because i need to look at the table schema and get the data type of selected columns and then create hive table.
To brief about work environment.
Services running on Data Center:
Oracle DB
Sqoop on linux machine.
sqoop talks to oracle db and configured to push the data on S3.
Services running on AWS:
S3
AWS EMR hive
hive talks to S3 and uses S3 data to repair the table.
1)
to ease your hive table generation, you may use Oracle dictionary
SELECT t.column_name || ' ' ||
decode(t.data_type, 'VARCHAR2', 'VARCHAR', 'NUMBER', 'DOUPLE') ||
' COMMENT '||cc.comments||',',
t.*
FROM user_tab_columns t
LEFT JOIN user_col_comments cc
ON cc.table_name = t.table_name
AND cc.column_name = t.column_name
WHERE t.table_name in ('T1','T2')
ORDER BY t.table_name, t.COLUMN_id;
First column of this data set will be your column list for CREATE TABLE command.
You need to modify DECODE to correctly trunslate Oracle types to Hive types
2)
As I remember, sqoop easily export table, so you may create view in Oracle to hide join query inside and export this view by sqoop:
CREATE OR REPLACE VIEW V_T1_T2 AS
SELECT * FROM T1 JOIN T2 ON ...;

How to delete and update a record in Hive

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive.
Thanks
hive> delete from student where id=1;
Usage: delete [FILE|JAR|ARCHIVE] <value> [<value>]*
Query returned non-zero code: 1, cause: null
As of Hive version 0.14.0: INSERT...VALUES, UPDATE, and DELETE are now available with full ACID support.
INSERT ... VALUES Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is:
( value [, value ...] )
where a value is either null or any valid SQL literal
UPDATE Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
DELETE Syntax:
DELETE FROM tablename [WHERE expression]
Additionally, from the Hive Transactions doc:
If a table is to be used in ACID writes (insert, update, delete) then the table property "transactional" must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.
Hive DML reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Hive Transactions reference:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
The following applies to versions prior to Hive 0.14, see the answer by ashtonium for later versions.
There is no operation supported for deletion or update of a particular record or particular set of records, and to me this is more a sign of a poor schema.
Here is what you can find in the official documentation:
Hadoop is a batch processing system and Hadoop jobs tend to have high latency and
incur substantial overheads in job submission and scheduling. As a result -
latency for Hive queries is generally very high (minutes) even when data sets
involved are very small (say a few hundred megabytes). As a result it cannot be
compared with systems such as Oracle where analyses are conducted on a
significantly smaller amount of data but the analyses proceed much more
iteratively with the response times between iterations being less than a few
minutes. Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not offer
real-time queries and row level updates. It is best used for batch jobs over
large sets of immutable data (like web logs).
A way to work around this limitation is to use partitions: I don't know what you id corresponds to, but if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.
Yes, rightly said. Hive does not support UPDATE option.
But the following alternative could be used to achieve the result:
Update records in a partitioned Hive table:
The main table is assumed to be partitioned by some key.
Load the incremental data (the data to be updated) to a staging table partitioned with the same keys as the main table.
Join the two tables (main & staging tables) using a LEFT OUTER JOIN operation as below:
insert overwrite table main_table partition (c,d)
select t2.a, t2.b, t2.c,t2.d from staging_table t2 left outer join main_table t1 on t1.a=t2.a;
In the above example, the main_table & the staging_table are partitioned using the (c,d) keys. The tables are joined via a LEFT OUTER JOIN and the result is used to OVERWRITE the partitions in the main_table.
A similar approach could be used in the case of un-partitioned Hive table UPDATE operations too.
You can delete rows from a table using a workaround, in which you overwrite the table by the dataset you want left into the table as a result of your operation.
insert overwrite table your_table
select * from your_table
where id <> 1
;
The workaround is useful mostly for bulk deletions of easily identifiable rows. Also, obviously doing this can muck up your data, so a backup of the table is adviced and care when planning the "deletion" rule also adviced.
Once you have installed and configured Hive , create simple table :
hive>create table testTable(id int,name string)row format delimited fields terminated by ',';
Then, try to insert few rowsin test table.
hive>insert into table testTable values (1,'row1'),(2,'row2');
Now try to delete records , you just inserted in table.
hive>delete from testTable where id = 1;
Error!
FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.
By default transactions are configured to be off. It is been said that update is not supported with the delete operation used in the conversion manager. To support update/delete , you must change following configuration.
cd $HIVE_HOME
vi conf/hive-site.xml
Add below properties to file
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.enforce.bucketing</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>2</value>
</property>
Restart the service and then try delete command again :
Error!
FAILED: LockException [Error 10280]: Error communicating with the metastore.
There is problem with metastore. In order to use insert/update/delete operation, You need to change following configuration in conf/hive-site.xml as feature is currently in development.
<property>
<name>hive.in.test</name>
<value>true</value>
</property>
Restart the service and then delete command again :
hive>delete from testTable where id = 1;
Error!
FAILED: SemanticException [Error 10297]: Attempt to do update or delete on table default.testTable that does not use an AcidOutputFormat or is not bucketed.
Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
Tables must be bucketed to make use of these features. Tables in the same system not using transactions and ACID do not need to be bucketed.
See below built table example with ORCFileformat, bucket enabled and ('transactional'='true').
hive>create table testTableNew(id int ,name string ) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');
Insert :
hive>insert into table testTableNew values (1,'row1'),(2,'row2'),(3,'row3');
Update :
hive>update testTableNew set name = 'updateRow2' where id = 2;
Delete :
hive>delete from testTableNew where id = 1;
Test :
hive>select * from testTableNew ;
Configuration Values to Set for INSERT, UPDATE, DELETE
In addition to the new parameters listed above, some existing parameters need to be set to support INSERT ... VALUES, UPDATE, and DELETE.
Configuration key
Must be set to
hive.support.concurrency true (default is false)
hive.enforce.bucketing true (default is false) (Not required as of Hive 2.0)
hive.exec.dynamic.partition.mode nonstrict (default is strict)
Configuration Values to Set for Compaction
If the data in your system is not owned by the Hive user (i.e., the user that the Hive metastore runs as), then Hive will need permission to run as the user who owns the data in order to perform compactions. If you have already set up HiveServer2 to impersonate users, then the only additional work to do is assure that Hive has the right to impersonate users from the host running the Hive metastore. This is done by adding the hostname to hadoop.proxyuser.hive.hosts in Hadoop's core-site.xml file. If you have not already done this, then you will need to configure Hive to act as a proxy user. This requires you to set up keytabs for the user running the Hive metastore and add hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hive.groups to Hadoop's core-site.xml file. See the Hadoop documentation on secure mode for your version of Hadoop (e.g., for Hadoop 2.5.1 it is at Hadoop in Secure Mode).
The UPDATE statement has the following limitations:
The expression in the WHERE clause must be an expression supported by a Hive SELECT clause.
Partition and bucket columns cannot be updated.
Query vectorization is automatically disabled for UPDATE statements. However, updated tables can still be queried using vectorization.
Subqueries are not allowed on the right side of the SET statement.
The following example demonstrates the correct usage of this statement:
UPDATE students SET name = null WHERE gpa <= 1.0;
DELETE Statement
Use the DELETE statement to delete data already written to Apache Hive.
DELETE FROM tablename [WHERE expression];
The DELETE statement has the following limitation:
query vectorization is automatically disabled for the DELETE operation.
However, tables with deleted data can still be queried using vectorization.
The following example demonstrates the correct usage of this statement:
DELETE FROM students WHERE gpa <= 1,0;
The CLI told you where is your mistake : delete WHAT? from student ...
Delete : How to delete/truncate tables from Hadoop-Hive?
Update : Update , SET option in Hive
If you want to delete all records then as a workaround load an empty file into table in OVERWRITE mode
hive> LOAD DATA LOCAL INPATH '/root/hadoop/textfiles/empty.txt' OVERWRITE INTO TABLE employee;
Loading data to table default.employee
Table default.employee stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.19 seconds
hive> SELECT * FROM employee;
OK
Time taken: 0.052 seconds
Upcoming version of Hive is going to allow SET based update/delete handling which is of utmost importance when trying to do CRUD operations on a 'bunch' of rows instead of taking one row at a time.
In the interim , I have tried a dynamic partition based approach documented here http://linkd.in/1Fq3wdb .
Please see if it suits your need.
UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable.
A snippet from Hadoop: The Definitive Guide(3rd edition):
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these features have not been considered a part of Hive's feature set. This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. For a data warehousing application that runs over large portions of the dataset, this works well.
Hive doesn't support updates (or deletes), but it does support INSERT INTO, so it is possible to add new rows to an existing table.
To achieve your current need, you need to fire below query
> insert overwrite table student
> select *from student
> where id <> 1;
This will delete current table and create new table with same name with all rows except the rows that you want to exclude/delete
I tried this on Hive 1.2.1
There are few properties to set to make a Hive table support ACID properties and to support UPDATE ,INSERT ,and DELETE as in SQL
Conditions to create a ACID table in Hive.
1. The table should be stored as ORC file .Only ORC format can support ACID prpoperties for now
2. The table must be bucketed
Properties to set to create ACID table:
set hive.support.concurrency =true;
set hive.enforce.bucketing =true;
set hive.exec.dynamic.partition.mode =nonstrict
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads= 1;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set the property hive.in.test to true in hive.site.xml
After setting all these properties , the table should be created with tblproperty 'transactional' ='true'. The table should be bucketed and saved as orc
CREATE TABLE table_name (col1 int,col2 string, col3 int) CLUSTERED BY col1 INTO 4
BUCKETS STORED AS orc tblproperties('transactional' ='true');
Now the Hive table can support UPDATE and DELETE queries
Delete has been recently added in Hive version 0.14
Deletes can only be performed on tables that support ACID
Below is the link from Apache .
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete
Good news,Insert updates and deletes are now possible on Hive/Impala using Kudu.
You need to use IMPALA/kudu to maintain the tables and perform insert/update/delete records.
Details with examples can be found here:
insert-update-delete-on-hadoop
Please share the news if you are excited.
-MIK
Recently I was looking to resolve a similar issue, Apache Hive, Hadoop do not support Update/Delete operations. So ?
So you have two ways:
Use a backup table: Save the whole table in a backup_table, then truncate your input table, then re-write only the data you are intrested to mantain.
Use Uber Hudi: It's a framework created by Uber to resolve the HDFS limitations including Deletion and Update. You can give a look in this link:
https://eng.uber.com/hoodie/
an example for point 1:
Create table bck_table like input_table;
Insert overwrite table bck_table
select * from input_table;
Truncate table input_table;
Insert overwrite table input_table
select * from bck_table where id <> 1;
NB: If the input_table is an external table you must follow the following link:
How to truncate a partitioned external table in hive?
If you want to perform Hive CRUD using ACID operations, you need check whether you have
hive 0.14 version or not
In order to perform CREATE, SELECT, UPDATE, DELETE, We have to ensure while creating the table with the following conditions
File format should be in ORC file format with
TBLPROPERTIES(‘transactional’=’true’)
Table should be CLUSTERED BY
with some Buckets, please refer the below CREATE TABLE statement.
You can use below query to create table with above properties-
CREATE TABLE STUDENT
(
STD_ID INT,
STD_NAME STRING,
AGE INT,
ADDRESS STRING
)
CLUSTERED BY (ADDRESS) into 3 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED as orc tblproperties('transactional'='true');

Hive: Create New Table from Existing Partitioned Table

I'm using Amazon's Elastic MapReduce and I have a hive table created based on a series of log files stored in Amazon S3 and split in folders by day like so:
data/day=2011-09-01/log_file.tsv
data/day=2011-09-02/log_file.tsv
I am currently trying to create an additional table which filters out some unwanted activity in these log files but I can't figure out how to do this and keep getting errors such as:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
If my initial table create statement looks something like this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
... fields ...
)
PARTITIONED BY ( DAY STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/data/';
That initial table works fine and I've been able to query it with no problems.
How then should I create a new table that shares the structure of the previous one but simply filters out data? This doesn't seem to work.
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
FROM table1
INSERT OVERWRITE TABLE table2
SELECT * WHERE
col1 = '%somecriteria%' AND
more criteria...
;
As I've stated above, this returns:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
Thanks!
This always works for me:
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
INSERT OVERWRITE TABLE table2 PARTITION (day) SELECT col1, col2, ..., day FROM table1;
ALTER TABLE table2 RECOVER PARTITIONS;
Notice that I've added 'day' as a column in the SELECT statement. Also notice that there is an ALTER TABLE line which is necessary for Hive to become aware of the partitions that were newly created in table2.
I have never used the like option.. so thanks for showing me that. Will that actually create all of the partitions that the first table has as well? If not, that could be the issue. You could try using dynamic partitions:
create external table if not exists table2 like table1;
insert overwrite table table2 partition(part) select col1, col2 from table1;
Might not be the best solution, as I think you have to specify your columns in the select clause (as well as the partition column in the partition clause).
And, you must turn on dynamic partitioning.
I hope this helps.

Resources