Can we insert into a view in Hive?
I have done this in the past with Oracle and Teradata.
But, doesn't seem to work in Hive.
create table t2 (id int, key string, value string, ds string, hr string);
create view v2 as select id, key, value, ds, hr from t2;
insert into v2 values (1,'key1','value1','ds1','hr1')
***Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: Unable to determine if null is encrypted: java.lang.NullPointerException***
These seems to be some sort of update support in view. But, I can't see anything on insert into a view.
https://cwiki.apache.org/confluence/display/Hive/UpdatableViews
Thanks for the feedback. Makes sense. The reason behind needing this functionality is, we use an ETL tool that has problems with handling high precision decimals (>15 digits). If the object(table->column in this case) is represented as string within the tool, we don't have a problem. So, i thought i'll define a bunch of views with string datatypes and use that in the tool instead. But, can't do inserts in hive to view. So, may be i need to think of something else. Have done this way before with oracle and teradata.
Can we have two tables with different structures point to the same underlying hdfs content? Probably wouldn't work because fo the parquet storage which stores schema. Sorry, not a hadoop expert.
Thanks a lot for your time.
It is not possible to insert data in a Hive view, Hive view is just a projection of a Hive table (you can see it as presaved query). From Hive documentation
Note that a view is a purely logical object with no associated
storage. (No support for materialized views is currently available in
Hive.) When a query references a view, the view's definition is
evaluated in order to produce a set of rows for further processing by
the query. (This is a conceptual description; in fact, as part of
query optimization, Hive may combine the view's definition with the
query's, e.g. pushing filters from the query down into the view.)
The link (https://cwiki.apache.org/confluence/display/Hive/UpdatableViews) seems to be for a proposed feature.
Per the official documentation:
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER.
Related
Having a structure where there is a base table, then a materialized view base_mv that aggregates sending the result TO an AggregatedMergeTree table base_agg_by_id. Then we have a view over this final table. base_unique. Similarly as in this blog post](https://www.altinity.com/blog/clickhouse-continues-to-crush-time-series).
However, if I delete from base, I would expect the base_mv would trigger the mutation and then act on it, and reflected on the base_agg_by_id, but it doesn't.
Is this the expected behaviour? How to DELETE in such a schema?
I've seen here that in MVs that keep data you can act on .inner tables. However in this case, since the table is from an AggregatedMergeTree and its fields are defined as functions (e.g. AggregateFunction(argMax, String, DateTime) ), I cannot apply a deletion via a value such as ALTER base_agg_by_id DELETE WHERE field = 'myval'.
Note. For the record, we have these tables in a replicated environment using Replicated* engine: base_d, base_agg_by_id_d, base_unique_d
Mutations are not propagated to materialized views.
The reason is very simple: it not possible in common case. And even in cases when it is theoretically possible it can be very expensive operation.
For example, let's say you're deleting one record from the table which references some userid. And your materialized view contains uniqState( userid ). Data structures used for calculating uniqState don't support 'remove' operation; but even if they would - the is no way to decide if that userid should be removed or not without rereading whole data for the partition again because that userid could be seen in other records too.
So in general case, you need to refill the whole partition for your AggregatedMergeTree.
I.e. something like (daily partitioning case):
ALTER amt_table DROP PARTITION '2019-03-01';
-- use same select as in your materialized view
INSERT INTO amt_table SELECT ... WHERE date = '2019-03-01';
Can anyone please explain why and where do we use external tables in hive?
Please explain a scenario to understand easily.
We use external table when our underlying dataset pointed by hive table is shared by many purpose i.e for map reduce job, pig etc and use managed table in hive when our dataset pointed by hive table is used only by hive application.
Actually in hive managed table has full control on dataset i.e in managed table if you will drop the table dataset will also be deleted from hive warehouse(/usr/hive/warehouse) present in HDFS, but in case of external table when you drop the table, dataset are not deleted from hive warehouse in HDFS.
Suppose take an example you have 50 gb data set now if you create multiple copies of dataset for different purpose it will simply take more space so the better option is to use external table so that when you drop the table dataset are not deleted and you can use it further by any other application like by pig or by any other purpose.
As a rule of thumb: use external table if you plan to work with those data not only from Hive but from other frameworks as well. Otherwise make it internal.
The only difference between External and Managed table in Hive is Drop table or Drop partition behavior. For Managed it will drop data as well, for External table the data will remain untouched in the table/partition location.
Use External in most cases. External table allows you to change table definition easily. Also you can create few tables on top of the same location.
Use Managed table if the table is temporary/intermediate and data should be deleted to free space.
Managed table can be converted to external and vice-versa using
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.
This is known to us that all DML statement has been supported by Oracle Regular Table but not the same for External Table? I tried below :
SQL> INSERT INTO xtern_empl_rpt VALUES ('70','Rakshit','Nantu','4587966214','na
tu.rakshit#ge.com','55');
INSERT INTO xtern_empl_rpt VALUES ('70','Rakshit','Nantu','4587966214','natu.ra
kshit#ge.com','55')
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table
SQL> update xtern_empl_rpt set FIRST_NAME='Arup' where SSN='896743856';
update xtern_empl_rpt set FIRST_NAME='Arup' where SSN='896743856'
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table
SQL>
So it seems External table not support this. But my question is - what the logical reason behind this design?
There is no mechanism in Oracle for locking rows in external tables, none of the concurrency controls which we get with regular heap tables. So updating is not allowed.
External tables created with the Oracle Loader driver are read only; the Datapump driver allows us to write to external table files but only in an CTAS mode.
The problem is that eternal tables are basically windows on OS files, without the layer of abstraction and control that internal tables offer. Basically, there is no way for the database to lock a record in an OS file, because the notion of a "record" is a databse thang, not an OS file thang.
External tables are designed for only one thing: data loading and unloading. They are simply not meant to be used with normal DML, and they're not really meant for normal selects either - that works, but if you need to do a lot of selections on an external table, you're "doing it wrong": load the data into proper tables, calculate statistics & add indexes as necessary.
Having external tables behave like normal tables would need that all the transactional machinery be implemented for them, which is very complex, and not worth it since that's not what they are meant for.
If you need normal tables and want to transplant them from one Oracle database to another, you should evaluate using transportable tablespaces too.
Limitations of external table are an obvious consequence of their being read-only; they are an adapter to involve in SQL queries either arbitrary record-organized files (ORACLE_LOADER type) or exported copies of tables in another database (ORACLE_DATAPUMP type).
As already mentioned, external tables are only good for full table scan queries; if one needs to use indexes in heavy duty queries or to modify foreign data sets that have been imported from files, regular tables can be populated using the SQL Loader tool.
I have following query which has select query that returns data in 5sec. But when I add create materialized view command infront it takes ever for the query to create materialized view.
When you create a materialized view, you actually create a copy of the data that Oracle takes care to keep synchronized (and it makes those views somewhat like indexes). If your view operates over a big amount of data or over data from other servers, it's natural that the creating this view can take time.
From docs.oracle.com:
A materialized view is a replica of a target master from a single
point in time.
Just for "yuks", try
create table temp_tab nologging as select ...
I've seen cases where MV creation is long for some reason, probably logging.
Also, query development tools sometimes begin returning the data to the screen right away, but if you "paged" to the last row, you would find out how long it really takes to get all the data.
You should profile the select statement with explain plan and understand the table cardinality, indexes, waits states when running, ... in order to see if the query needs tuning.