Dynamically Load Multiple CSV files to Oracle External Table - oracle

I am trying to load my oracle external table dynamically with multiple .csv files.
I am able to load one .csv file but as soon as I alter with new .csv file name, the table gets rewritten.
I have multiple .csv files in a folder which changes everyday with a prefix of the date.
Eg file name FileName1_20200607.csv, FileName2_20200607.csv
I dont think there is a way to write 'FileName*20200607.csv' to pick all the files for that date?
My code:
......
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "DATA_DIR_PATH"
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE BADFILE CRRENG_ORA_APPS_OUT_DIR
: 'Filebad' DISCARDFILE DATA_OUT_PATH :
'Filedesc.dsc' LOGFILE DATA_OUT_PATH :
'Filelog.log' SKIP 0 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED
BY '"' AND '"' MISSING FIELD VALUES ARE NULL REJECT ROWS WITH ALL NULL
FIELDS )
LOCATION
( 'FileName1_20200607.csv',
'FileName2_20200607.csv'
)
);
But I want to populate these file name dynamically. It should pick up all the file names from the DATA_DIR. There are about 50 other file names.
I can add Unix script if need be.

Related

Failed make hive table on desired path and insert the values

I want to make table in hive containing of only 1 column and 2 values: 'Y' and 'N'
I already try this:
create external table if not exists tx_test_table (FLAG string)
row format delimited fields terminated by ','
stored as textfile location "/user/hdd/data/";
My question is : why it locate at default table?
how to make it through the path I desire?
When I make query from the table I jut make, it failed to show the field (using select * from )
Bad status for request TFetchResultsReq(fetchType=0,
operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None,
operationType=0,
operationId=THandleIdentifier(secret='pE\xff\xfdu\xf6B\xd4\xb3\xb7\x1c\xdd\x16\x95\xb85',
guid="\n\x05\x16\xe7'\xe4G \xb6R\xe06\x0b\xb9\x04\x87")),
orientation=4, maxRows=100):
TFetchResultsResp(status=TStatus(errorCode=0,
errorMessage='java.io.IOException: java.io.IOException: Not a file:
hdfs://nameservice1/user/hdd/data/AC22', sqlState=None,
infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException:
java.io.IOException: Not a file: hdfs://nameservice1/user/hdd/data/AC22:14:13',
'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:496',
'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:297',
'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:869', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:507',
'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:708',
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717',
'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702',
'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor:process:HadoopThriftAuthBridge.java:605',
'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149',
'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748',
'*java.io.IOException:java.io.IOException: Not a file: hdfs://nameservice1/user/hdd/data/AC22:18:4',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521'
, 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428',
'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:146',
'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2227',
'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:491',
'*java.io.IOException:Not a file: hdfs://nameservice1/user/hdd/data/AC22:21:3',
'org.apache.hadoop.mapred.FileInputFormat:getSplits:FileInputFormat.java:329',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextSplits:FetchOperator.java:372',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:304',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:459'], statusCode=3),
results=None, hasMoreRows=None)
Each table in HDFS has it's own location. And location you specified for your table seems used as common location where other table folders are located.
According to the exception: java.io.IOException:Not a file: hdfs://nameservice1/user/hdd/data/AC22:21:3', at least one folder (not a file) was found in the /user/hdd/data/ location. I guess it belongs to some other table.
You should specify table location where will be stored only files which belong to this table, not the common data warehouse location, in which other table locations are.
Usually table location is named as table name: /user/hdd/data/tx_test_table
Fixed create table sentence:
create external table if not exists tx_test_table (FLAG string)
row format delimited fields terminated by ','
stored as textfile location "/user/hdd/data/tx_test_table";
Now table will have it's own location which will contain it's files, not mixed with other table folders or files.
You can put files into /user/hdd/data/tx_test_table location or load data into the table using INSERT, files will be created in the location.

How to store multiple files under the same directory in hive?

I'm using Hive to process my CSV files. I've stored CSV files in HDFS and wanna create tables from those files.
I use the following command:
create external table if not exists csv_table (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/CsvData/csv_table.csv' OVERWRITE INTO TABLE csv_table;
So the file under /CsvData will be moved into /user/hive. It makes sense.
But how if I want to create another table?
create external table if not exists csv_table2 (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/CsvData/csv_table2.csv' OVERWRITE INTO TABLE csv_table2;
It will raise an exception complaining that the directory is not empty.
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Directory hdfs://localhost:9000/user/hive could not be cleaned up.
So it is hard for me to understand, does it mean I can store only one file understand one directory? To store multiple files I have to create one directory for every file?
Is it possible to store all the files together?
Create table sentence will NOT raise an exception complaining that the directory is not empty because it is quite normal scenario when you create table on top of existing directory.
You can store as many files in the directory as necessary. And all of them will be accessible to the table built on top of the folder.
Table location is directory, not file. If you need to create new table and keep it's files not mixed with other table then create separate folder.
Read also this answer for clear understanding: https://stackoverflow.com/a/54038932/2700344

Insert part of data from csv into oracle table

I have a CSV (pipe-delimited) file as below
ID|NAME|DES
1|A|B
2|C|D
3|E|F
I need to insert the data into a temp table where I already have SQLLODER in place, but my table have only one column. The below is the control file configuration for loading from csv.
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET UTF8
TRUNCATE
INTO TABLE EMPLOYEE
FIELDS TERMINATED BY '|'
TRAILING NULLCOLS
(
NAME
)
How do I select the data from only 2nd column from the csv and insert into only one column in the table EMPLOYEE?
Please let me know if you have any questions.
If you're using a filler field you don't need to have a matching column in the database table - that's the point, really - and as long as you know the field you're interested in is always the second one, you don't need to modify the control file if there are extra fields in the file, you just never specify them.
So this works, with just a filler ID field added and the three-field data file you showed:
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET UTF8
TRUNCATE
INTO TABLE EMPLOYEE
FIELDS TERMINATED BY '|'
TRAILING NULLCOLS
(
IF FILLER,
NAME
)
Dmoe'd with:
SQL> create table employee (name varchar2(30));
$ sqlldr ...
Commit point reached - logical record count 3
SQL> select * from employee;
NAME
------------------------------
A
C
E
Adding more fields to the data file makes no difference, as long as they are after the field you are actually interested in. The same thing works for external tables, which can be more convenient for temporary/staging tables, as long as the CSV file is available on the database server.
Columns in data file which needs to be excluded from load can be defined as FILLER.
In given example use following. List all incoming fields and add filler to those columns needs to be ignored from load, e.g.
(
ID FILLER,
NAME,
DES FILLER
)
Another issue here is to ignore header line as in CSV so just use OPTIONS clause e.g.
OPTIONS(SKIP=1)
LOAD DATA ...
Regards,
R.

Hive - How to load full html file content to a single hive row?

I hav 1,000 *.html files in HDFS path and I want to create HIVE table whit this files.
But below query give me a '\n' delimited rows rather than full content of the html.
> create external table if not exist mydb.myhtmltable (
> body STRING )
> STORED AS TEXTFILE
> LOCATION '/user/hadoop/dataset/refhtml';
How can I put full html content into .body field?
I want 1,000 rows from 1,000 html file.
Is it possible?
Add this:
LINES TERMINATED BY \789
where 789 is the octal representation of the unicode character you want to use.
so:
create external table if not exist mydb.myhtmltable (
body STRING )
STORED AS TEXTFILE
LINES TERMINATED BY \789
LOCATION '/user/hadoop/dataset/refhtml';

Oracle SQL save file name from LOCATION as a column in external table

I have several input files being read into an external table in Oracle. I want to run some queries across the content from all the files, however, there are some queries where I would like to filter the data based on the input file it came from. Is there a way to access the name of the source file in a select statement against an external table or somehow create a column in the external table that includes the location source.
Here is an example:
CREATE TABLE MY_TABLE (
first_name CHAR(100 BYTES)
last_name CHAR(100 BYTES)
)
ORGANIZATION EXTERNAL
TYPE ORACLE_LOADER
DEFAULT DIRECTORY TMP
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
SKIP 1
badfile 'my_table.bad'
discardfile 'my_table.dsc'
LOGFILE 'my_table.log'
FIELDS terminated BY 0x'09' optionally enclosed BY '"' LRTRIM missing field VALUES are NULL
(
first_name char(100),
last_name
)
)
LOCATION ( TMP:'file1.txt','file2.txt')
)
REJECT LIMIT 100;
select distinct last_name
from MY_TABLE
where location like 'file2.txt' -- This is the part I don't know how to code
Any suggestions?
There is always the option to add the file name to the input file itself as an additional column. Ideally, I would like to avoid this work around.
The ALL_EXTERNAL_LOCATIONS data dictionary view contains information about external table locations. Also DBA_* and USER_* versions.
Edit: (It would help if I read the question thoroughly.)
You don't just want to read the location for the external table, you want to know which row came from which file. Basically, you need to:
Create a shell script that adds the file location to the file contents and sends them to stdin.
Add the PREPROCESSOR directive to your external table definition to execute the script.
Alter the external table definition to include a column to show the filename appended in the first step.
Here is an asktom article explaining it in detail.

Resources