Why does the user need write permission on the location of external hive table? - hadoop

In Hive, you can create two kinds of tables: Managed and External
In case of managed table, you own the data and hence when you drop the table the data is deleted.
In case of external table, you don't have ownership of the data and hence when you delete such a table, the underlying data is not deleted. Only metadata is deleted.
Now, recently i have observed that you can not create an external table over a location on which you don't have write (modification) permissions in HDFS. I completely fail to understand this.
Use case: It is quite common that the data you are churning is huge and read-only. So, to churn such data via Hive, will you have to copy this huge data to a location on which you have write permissions?
Please help.

Though it is true that dropping an external data does not result in dropping the data, this does not mean that external tables are for reading only. For instance, you should be able to do an INSERT OVERWRITE on an external table.
That being said, it is definitely possible to use (internal) tables when you only have read access, so I suspect this is the case for external tables as well. Try creating the table with an account that has write acces, and then using it with your regular account.

Related

Can a Fivetran destination table be used for writing after canceling synchronisation on this table?

I have multiple Fivetran destination tables in the Snowflake. Those tables were created by the Fivetran itself and the Fivetran currently writes data into the tables. Now I would like to stop syncing data in one of the tables and start writing to the table from a different source. Would I experience any troubles with this? Should I do something else in to make it possible?
What you mention is not possible because of how Fivetran works. Connector sources write to one destination schema and one only. Swapping destination tables between connectors is not a feature as of now.

Loading csv and writing bad records with individual errors

I am loading a csv file into my database using SQL Loader. My requirement is to create an error file combining the error records from .bad file and their individual errors from the log file. Meaning if a record has failed because the date is invalid, against that record in a separate column of error description , Invalid date should be written. Is there any way that SQL loader provides to combine the too. I am a newbie to SQL loader.
Database being used Oracle 19.c
You might be expecting a little bit too much of SQL*Loader.
How about switching to external table? In the background, it still uses SQL*Loader, but source data (which resides in a CSV file) is accessible to you by the means of a table.
What does it mean to you? You'd write some (PL/)SQL code to fetch data from it. Therefore, if you wrote a stored procedure, there are numerous options you can use - perform various validations, store valid data into one table and invalid data into another, decide what to do with invalid values (discard? Modify to something else? ...), handle exceptions - basically, everything PL/SQL offers.
Note that this option (generally speaking) requires the file to reside on the database server, in a directory which is a target of Oracle directory object. User which will be manipulating CSV data (i.e. the external table) will have to acquire privileges on that directory from the owner - SYS user.
SQL*Loader, on the other hand, runs on a local PC so you don't have to have access to the server itself but - as I said - doesn't provide that much flexibility.
it is hard to give you a code answer without the example.
If you want to do your task I can suggest two ways.
From Linux.
If you loaded data and skipped the errors, you must do two executions.
That is not an easy way and not effective.
From Oracle.
Create a table with VARCHAR2 columns with the same length as in the original.
Load data from bad_file. Convert your CTL adapted to everything. And try to load in
the second table.
Finally MERGE the columns to original.

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

accomplishing 'sqlldr' through a stand alone procedure

I am new in interface and write now I am gone through some assignment.
I have one question that is :
i am well acquainted with the method of loading the data from .dat file(and .ctl) into staging table using putty(using sqlldr) but i have a requirement to accomplish the same task(i.e loading data from .dat flie into staging table) through some pl/sql procedure . so please suggest the logic.....
Normally, you would use external tables. The syntax is going to be very similar to a SQL*Loader control file but it is an object defined in the database that allows you to expose the file as if it were a relational table. You can then do your load simply by querying the external table.
This does require, though, that the data file is present on the database server's file system.

How to get the Oracle external table "dump file" without doing "CREAT TABLE"

I have a to develop a PL/SQL procedure that dumps the content of a table when an error occurs during an application transaction, the content of the dump must match the content of the table before the ROLLBACK of the transaction.
I thought about using external table as the dump format of the table (TYPE ORACLE_DATAPUMP). After going through the Oracle documentation, I found that the only way to that is by executing:
CREATE TABLE tabtest_test (
F1 NUMBER,
F2 CHAR(10))
ORGANIZATION EXTERNAL (
TYPE ORACLE_DATAPUMP
DEFAULT DIRECTORY USER_DUMP_DEST
LOCATION ('tabtest.dmp’));
The problem is that by executing the “CREATE TABLE”, Oracle performs an implicit commit within our failed transaction which needs to be rolled back after the dump of the table.
I thought about using the “PRAGMA AUTONOMOUS_TRANSACTION;” to execute the “CREATE TABLE”, but it doesn’t really fit our need as it dumps the content of the table outside our application transaction.
My question: is there a way to get the 'tabtest.dmp’ without doing a “CREATE TABLE” ? for example by accessing directly the Oracle API responsible for this.
Regards.
How about creating the external table once, as part of your application setup process?
Failing that, you could create it at the beginning of the transaction that might need it. If there is an error, populate it; if the transaction finishes successfully, drop it.
If (and it's a big IF) you can use AUTONOMOUS_TRANSACTIONS to create the table in a separate transaction, I think this is what you need to do. If you manage to create the table within the scope of your current transaction, and write your data to that newly-created table, that data should, by all rights, disappear as soon as you do your ROLLBACK.
The problems you're experiencing here are a subset of the large class of issues known as "Problems Which Occur When Trying To Treat A Relational Database As A Flat File". Relational databases are great when used AS DATABASES, but are really bad at being flat files. It's kind of like animals on the farm - sheep are great AS SHEEP, but make lousy cows. Cows make lousy goats. Goats - great animals - intelligent, affectionate (yep), low-maintenance, won't hear a word spoken against 'em - but NOT what you want in a draft animal - use a horse, ox, or mule for that. Basically, you should pick horses for courses (pardon the expression). A database makes a crappy flat file, and vice versa. Use what's appropriate.
IMO you'd be better off writing your data to a flat file, and perhaps this file could be mapped in as an external table. You might want to write the file in something like CSV format that lots of other tools can process. YMMV.
Share and enjoy.
Why do you need to use external tables? You could just read the file using UTL_FILE.

Resources