Why External table concept has been established in Oracle? - oracle

SQL*Loader: Oracle uses this functionality, through the ORACLE_LOADER access driver to move data from a flat file into the database;
Data Pump: It uses a Data Pump access driver to move data out of the database into a file in an proprietary Oracle format, and back into the database from files of that format.
When a data load can be done by either the SQL*Loader or Data Pump utilities, and data unload can also be done by the Data Pump utility:
Are there any extra benefits that can be achieved by using external tables, that none of the previously mentioned utilities can do by themselves?
The below Oracle table creation command creates a table which looks like an Oracle table.Why are then Oracle telling us to call it as an external table?
create table export_empl_info organization external
( type oracle_datapump
default directory xtern_data_dir
location ('empl_info_rpt.dmp')
) as select * from empl_info;

"Are there any extra benefits that can be achieved by using external
tables, that none of the previously mentioned utilities can do by
themselves?"
SQL*loader and Datapump both require us to load the data into tables before we can access it with the database. Whereas we only access external tables through SELECT statements. It's a much more flexible mechanism.
"Why are then Oracle telling us to call it as an external table?"
umm, because it is external. The data resides in an file (or files) which is controlled by the OS. We can change the data in an external table by running an OS command like
$> cp wnatever.csv external_table_data.csv
There's no redo, rollback, flashback query or any of the other appurtenances of an internal database table.

I think that the primary benefits of external tables for me have been:
i) Not having to execute a host command to import data, which supports a trend in Oracle to control the entire code bade from inside the database. Preprocessing in 11g allows access to remote files through ftp, use of compressed files, combining multiple files into one, etc
ii) More efficient loads, by means of applying complex data transformations during the load process. Aggregations, merges, multitable inserts ... etc
I've used it for data warehouse loads, but any scenario requiring loading of or access to standard data files is a candidate for use of external tables. SQL*Loader still has its place as a tool for loading to an Oracle database from a client or other host system. Data pump is for transfer of data between Oracle databases, so it's rather different.
One limitation of external tables is that they won't process stream data -- records have to be delimited. This was true in 10.2, not sure if it's been permitted since then.
Use the system catalog views ALL/DBA/USER_EXTERNAL_TABLES for information on them

RE: Why external table vs sqlldr for loading data? Mainly to have server managed parallelism vs client managed parallelism.

Related

what's a processing storage for HIVE?

I'm new to hive and read about it online too. But still having doubts which are not cleared.
for hive external tables, hive keep table's metadata within HDFS, but not in its warehouse which is also in HDFS. correct ?
whether its internal or external table, in both cases data of table will be available in HDFS only but NOWHERE else. Mean to say, data can taken from anywhere but has to be loaded in HDFS, because HIVE uses hadoop's processing engine to process data. Correct ?
internal table, table's metadata and table's data both will be available in HIVE's data warehouse, and this data warehouse will be at nowhere else but in HDFS only. correct ?
in external table, table's metadata and table's data both will be NOT available in HIVE's data warehouse but in HDFS. But hive must be keeping some info with itself that where is table's metadata located and where is its data located in HDFS, correct ?
Can anyone share feedback to above understanding ?
THanks
Hive uses relational database like MySQL, MariaDB, PostgreSQL, Oracle, DerbyDB(for embedded deployment only) for storing metadata (databases, tables definitions, statistics, grants, etc). See deployment modes and database requirements. Does not matter Internal or external table, the metadata are stored in the relational database.
Yes, the data is stored in HDFS, but also Hive supports integration with external databases using JDBC storage handler. Such table looks like normal Hive table, but the data is stored in some database, your queries executed in the database, predicate push-down works, you can use hive native tables with storage handler tables in single query. Also HBase storage handler is available, Kafka storage handler, etc, you can write your own storage handler.
Depending on your Hive version/vendor It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS. Though Cloudera prefers to have managed tables in dedicated HDFS location for them, see https://stackoverflow.com/a/67073849/2700344 and does not allow to specify location for managed tables outside the warehouse root by default. Read abot the difference between managed and external tables here.
Everything seems correct except last one. When you create external table table metadata will be stored in the Hive otherwise you can not query through hive. HDFS itself keeps control of your data when you create external table. While when you create internal table Hive will be responsible. Dropping internal table drops your data and metadata but dropping external table only drops metadata from Hive. But your data will be remain inside of your file system. Thats why we are changing table types a lot as a workaround when some of our external connection is not compatible with our hive version.

External table limitations

Now i am working on external tables... While i do like its flexibility. I would like to know these things about external table -
Like in SQL Loader we can append data to the table . Can we do that in External table ?
In external table , we cannot create indexes neither can we perform DML operations. Is this kind of virtual table or this acquires space in the data base ?
Also in SQL loader we can access the data from any server in external table we define the default directory. Can we in turn do the same in external table that is access the data from any server ?
External tables allow Oracle to query data that is stored outside the database in flat files as though the file were an Oracle table.
The ORACLE_LOADER driver can be used to access any data stored in any format that can be loaded by SQL*Loader. No DML can be performed on external tables but they can be used for query, join and sort operations. Views and synonyms can be created against external tables. They are useful in the ETL process of data warehouses since the data doesn't need to be staged and can be queried in parallel. They should not be used for frequently queried tables.
You asked:
like in SQL Loader we can append data to the table . Can we do that in External table ?
Yes.
In external table , we cannot create indexes neither can we perform DML operations. Is this kind of virtual table or this acquires space in the data basE ?
As the name suggests, it is external to the database. You use ORGANIZATION EXTERNAL syntax. The directory is created at OS level.
Also in SQL loader we can access teh data from any server in external table we define the DEfault directory. Can we in turn do the same in external table that is access the data from any server ?
This is wrong. SQL*Loader is a client-side tool, while external table is a server-side tool. External Table can load file which is accessible from database server. You can't load External Table from file residing on your client. You need to save the files to a filesystem available to the Oracle server.
Prior to version 10g, external tables were READ ONLY. DML could not be performed. Starting with version Oracle Database 10g, external tables can be written to as well as read from.
From documentation, also read Behavior Differences Between SQL*Loader and External Tables

Difference between Hive internal tables and external tables?

Can anyone tell me the difference between Hive's external table and internal tables.
I know the difference comes when dropping the table. I don't understand what you mean by the data and metadata is deleted in internal and only metadata is deleted in external tables.
Can anyone explain me in terms of nodes please.
Hive has a relational database on the master node it uses to keep track of state.
For instance, when you CREATE TABLE FOO(foo string) LOCATION 'hdfs://tmp/';, this table schema is stored in the database.
If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the file-system and finding them, etc). These sorts of things are the 'metadata'.
When you drop an internal table, it drops the data, and it also drops the metadata.
When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.
Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
To answer you Question :
For External Tables, Hive stores the data in the LOCATION specified during creation of the table(generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables, Hive stores data into its warehouse directory. If the table is dropped then both the table metadata and the data will be deleted.
For your reference,
Difference between Internal & External tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS
file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir,
by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends
on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source :
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
An internal table data is stored in the warehouse folder, whereas an external table data is stored at the location you mentioned in table creation.
So when you delete an internal table, it deletes the schema as well as the data under the warehouse folder, but for an external table it's only the schema that you will loose.
So when you want an external table back you again after deleting it, can create a table with the same schema again and point it to the original data location. Hope it is clear now.
The only difference in behaviour (not the intended usage) based on my limited research and testing so far (using Hive 1.1.0 -cdh5.12.0) seems to be that when a table is dropped
the data of the Internal (Managed) tables gets deleted from the HDFS file system
while the data of the External tables does NOT get deleted from the HDFS file system.
(NOTE: See Section 'Managed and External Tables' in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL which list some other difference which I did not completely understand)
I believe Hive chooses the location where it needs to create the table based on the following precedence from top to bottom
Location defined during the Table Creation
Location defined in the Database/Schema Creation in which the table is created.
Default Hive Warehouse Directory (Property hive.metastore.warehouse.dir in hive.site.xml)
When the "Location" option is not used during the "creation of a hive table", the above precedence rule is used. This is applicable for both Internal and External tables. This means an Internal table does not necessarily have to reside in the Warehouse directory and can reside anywhere else.
Note: I might have missed some scenarios, but based on my limited exploration, the behaviour of both Internal and Extenal table seems to be the same except for the one difference (data deletion) described above. I tried the following scenarios for both Internal and External tables.
Creating table with and without Location option
Creating table with and without Partition Option
Adding new data using the Hive Load and Insert Statements
Adding data files to the Table location outside of Hive (using HDFS commands) and refreshing the table using the "MSCK REPAIR TABLE command
Dropping the tables
In external tables, if you drop it, it deletes only schema of the table, table data exists in physical location. So to deleted the data use hadoop fs - rmr tablename .
Managed table hive will have full control on tables. In external tables users will have control on it.
INTERNAL : Table is created First and Data is loaded later
EXTERNAL : Data is present and Table is created on top of it.
Internal tables are useful if you want Hive to manage the complete lifecycle of your data including the deletion, whereas external tables are useful when the files are being used outside of Hive.
External hive table has advantages that it does not remove files when we drop tables,we can set row formats with different settings , like serde....delimited
Also Keep in mind that Hive is a big data warehouse. When you want to drop a table you dont want to lose Gigabytes or Terabytes of data. Generating, moving and copying data at that scale can be time consuming.
When you drop a 'Managed' table hive will also trash its data.
When you drop a 'External' table only the schema definition from hive meta-store is removed. The data on the hdfs still remains.
Consider this scenario which best suits for External Table:
A MapReduce (MR) job filters a huge log file to spit out n sub log files (e.g. each sub log file contains a specific message type log) and the output i.e n sub log files are stored in hdfs.
These log files are to be loaded into Hive tables for performing further analytic, in this scenario I would recommend an External Table(s), because the actual log files are generated and owned by an external process i.e. a MR job besides you can avoid an additional step of loading each generated log file into respective Hive table as well.
The best use case for an external table in the hive is when you want to create the table from a file either CSV or text
Both Internal and External tables are owned by HIVE. The only difference being the ownership of data. The commands for creating both tables are shown below. Only an additional EXTERNAL keyword comes in case of external table creation. Both tables can be created/deleted/modified using SQL Statements.
In case of Internal Tables, both the table and the data contained in the tables are managed by HIVE. That is, we can add/delete/modify any data using HIVE. When we DROP the table, along with the table, the data will also get deleted.
Eg: CREATE TABLE tweets (text STRING, words INT, length INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
In case of External Tables, only the table is managed by HIVE. The data present in these tables can be from any storage locations like HDFS. We cant add/delete/modify the data in these tables. We can only use the data in these tables using SELECT statements. When we DROP the table, only the table gets deleted and not the data contained in it. This is why its said that only meta-data gets deleted. When we create EXTERNAL tables, we need to mention the location of the data.
Eg: CREATE EXTERNAL TABLE tweets (text STRING, words INT, length INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hive/warehouse/tweets';
hive stores only the meta data in metastore and original data in out side of hive when we use external table we can give location' ' by these our original data wont effect when we drop the table
When there is data already in HDFS, an external Hive table can be created to describe the data. It is called EXTERNAL because the data in the external table is specified in the LOCATION properties instead of the default warehouse directory.
When keeping data in the internal tables, Hive fully manages the life cycle of the table and data. This means the data is removed once the internal table is dropped. If the external table is dropped, the table metadata is deleted but the data is kept. Most of the time, an external table is preferred to avoid deleting data along with tables by mistake.
For managed tables, Hive controls the lifecycle of their data. Hive stores the data for managed tables in a sub-directory under the directory defined by hive.metastore.warehouse.dir by default.
When we drop a managed table, Hive deletes the data in the table.But managed tables are less convenient for sharing with other tools. For example, lets say we have data that is created and used primarily by Pig , but we want to run some queries against it, but not give Hive ownership of the data.
At that time, external table is defined that points to that data, but doesn’t take ownership of it.
In Hive We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory.
Dropping External tables will delete metadata but not the data.
I would like to add that
Internal tables are used when the data needs to be updated or some rows need to be deleted because ACID properties can be supported on the Internal tables but ACID properties cannot be supported on the external tables.
Please ensure that there is a backup of the data in the Internal table because if a internal table is dropped then the data will also be lost.
In simple words, there are two things:
Hive can manage things in warehouse i.e. it will not delete data out of warehouse.
When we delete table:
1) For internal tables the data is managed internally in warehouse. So will be deleted.
2) For external tables the data is managed eternal from warehouse. So can't be deleted and clients other then hive can also use it.

Why regular oracle table support DML statements,but not the same for External table?

This is known to us that all DML statement has been supported by Oracle Regular Table but not the same for External Table? I tried below :
SQL> INSERT INTO xtern_empl_rpt VALUES ('70','Rakshit','Nantu','4587966214','na
tu.rakshit#ge.com','55');
INSERT INTO xtern_empl_rpt VALUES ('70','Rakshit','Nantu','4587966214','natu.ra
kshit#ge.com','55')
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table
SQL> update xtern_empl_rpt set FIRST_NAME='Arup' where SSN='896743856';
update xtern_empl_rpt set FIRST_NAME='Arup' where SSN='896743856'
*
ERROR at line 1:
ORA-30657: operation not supported on external organized table
SQL>
So it seems External table not support this. But my question is - what the logical reason behind this design?
There is no mechanism in Oracle for locking rows in external tables, none of the concurrency controls which we get with regular heap tables. So updating is not allowed.
External tables created with the Oracle Loader driver are read only; the Datapump driver allows us to write to external table files but only in an CTAS mode.
The problem is that eternal tables are basically windows on OS files, without the layer of abstraction and control that internal tables offer. Basically, there is no way for the database to lock a record in an OS file, because the notion of a "record" is a databse thang, not an OS file thang.
External tables are designed for only one thing: data loading and unloading. They are simply not meant to be used with normal DML, and they're not really meant for normal selects either - that works, but if you need to do a lot of selections on an external table, you're "doing it wrong": load the data into proper tables, calculate statistics & add indexes as necessary.
Having external tables behave like normal tables would need that all the transactional machinery be implemented for them, which is very complex, and not worth it since that's not what they are meant for.
If you need normal tables and want to transplant them from one Oracle database to another, you should evaluate using transportable tablespaces too.
Limitations of external table are an obvious consequence of their being read-only; they are an adapter to involve in SQL queries either arbitrary record-organized files (ORACLE_LOADER type) or exported copies of tables in another database (ORACLE_DATAPUMP type).
As already mentioned, external tables are only good for full table scan queries; if one needs to use indexes in heavy duty queries or to modify foreign data sets that have been imported from files, regular tables can be populated using the SQL Loader tool.

Oracle sql result to DBF file export

I would like to export data from a Oracle table into *.dbf file (like excel) through PL/SQL scripts. Are there any codes available?
There are a number of different ways to do this. The easiest way is to use an IDE like SQL Developer or TOAD, which offer this functionality.
If you want to call it from PL/SQL, then then are no built-in Oracle functions. However, it is relatively straightforward to build something using UTL_FILE which can write out value separated records. These can be picked up in Excel.
Note that the default separator - , (comma being the "C" in .CSV) - will cause problems if your exported data contains commas. So you will need to use the Data Import wizard rather than a right-click Open With ...
Incidentally, it is probably a bad idea to use the .dbf suffix. In an Oracle file system the presumed meaning is database file - i.e. part of the database's infrastructure. This is just a convention, but there is no point in needlessly confusing people. Perferred alternatives include .csv, .dmp or .exp.
edit
If your interest is just to export data for transferring to another Oracle database then you should look at using the Data Pump utility. This comes with an API so it can be used from PL/SQL. Alternatively we unload data through external tables declared with a DataPump driver.
You could also consider using the External Tables feature of Oracle. This essentially allows you to map a CSV file to a 'virtual' table and the you can insert into it (and therefore the file.)
Oracle External Tables Concept Guide

Resources