hive, ask for files within specific range

hive, ask for files within specific range - hadoop

Suppose on HDFS I have file with following content: data1-2018-01-01.txt, data1-2018-01-02.txt, data1-2018-01-03.txt, data1-2018-01-04.txt, data1-2018-01-06.txt
Now I want to query files based on date:
select * from mytable where date > 2018-01-03 and date < 2018-01-06 ;
And my question: is it possible to create an external table just on these files satisfying my query? Or maybe you have any workaround?
I know, I could use partitions but they require to fetch the data manually when the new data set arrives.

Put those file into a directory and create new table on top of it.
Also Hive has INPUT__FILE__NAME virtual column, you can use it for filtering:
where INPUT__FILE__NAME like '%2018-01-03%'
Also it is possible to use substr or regexp_extract to get date from filename , then use IN or >, < to filter them.

Related

external tables: how to make sure i don't load same file/data

I want to use an external table to load a csv file as it's very convenient, but the problem is how do i make sure i don't load the same file twice in a row? i can't validate the data loaded because it can be the same information as before; i need to find a way to make sure the user doesnt load the same file as 2h ago for example.
I thought about uploading the file with a different name each time and issuing an alter table command to change the name of the file in the definition of the external table, but it sounds kinda risky.
I also thought about marking each row in the file with a sequence to help differentiate files, but i doubt the client would accept it as they would need to manually do this (the file is exported from somewhere).
Is there any better way to make sure i don't load the same file in the external table except changing the file's name and executing an alter on the table?
Thank you

when you bring the data from external table to your database you can use MERGE command instead of insert. it let you don't worry about duplicate data
see the blog about The Oracle Merge Command
What's more, we can wrap up the whole transformation process into this
one Oracle MERGE command, referencing the external table and the table
function in the one command as the source for the MERGED Oracle data.
alter session enable parallel dml;
merge /*+ parallel(contract_dim,10) append */
into contract_dim d
using TABLE(trx.go(
CURSOR(select /*+ parallel(contracts_file,10) full (contracts_file) */ *
from contracts_file ))) f
on d.contract_id = f.contract_id
when matched then
update set desc = f.desc,
init_val_loc_curr = f.init_val_loc_curr,
init_val_adj_amt = f.init_val_adj_amt
when not matched then
insert values ( f.contract_id,
f.desc,
f.init_val_loc_curr,
f.init_val_adj_amt);
So there we have it - our complex ETL function all contained within a
single Oracle MERGE statement. No separate SQL*Loader phase, no
staging tables, and all piped through and loaded in parallel

I can only think of a solution somewhat like this:
Have a timestamp encoded in the datafile name (like: YYYYMMDDHHMISS-file.csv), where YYYYMMDDHHMISS is the timestamp.
Create a table with the fields timestamp (as above).
Create a shell scripts that:
extracts the timestamp from the datafilename.
calls an sqlscript with the timestamp as the parameter, and return 0 if that timestamp does not exist, <>0 if the timestamp already exist, and in that case exit the script with the error: File: YYYYMMDDHHMISS-file.csv already loaded.
copy the YYYYMMDDDHHMISS-file.csv to input-file.csv.
run the sql loader script that loads the input-file.csv file
when succes: run a second sql script with parameter timestamp that inserts the record in the database to indicate that the file is loaded and move the original file to a backup folder.
when failure: report the failure of the load script.

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?

About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...

I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

bulk load UDT columns in Oracle

I have a table with the following structure:
create table my_table (
id integer,
point Point -- UDT made of two integers (x, y)
)
and i have a CSV file with the following data:
#id, point
1|(3, 5)
2|(7, 2)
3|(6, 2)
now i want to bulk load this CSV into my table, but i cant find any information about how to handle the UDT in Oracle sqlldr util. Is is possible to use the bulk load util when having UDT columns?

I don't know if sqlldr can do this, but personally I would use an external table.
Attach the file as an external table (the file must be on the database server), and then insert the contents of the external table into the destination table transforming the UDT into two values as you go. The following select from dual should help you with the translation:
select
regexp_substr('(5, 678)', '[[:digit:]]+', 1, 1) x_point,
regexp_substr('(5, 678)', '[[:digit:]]+', 1, 2) y_point
from dual;
UPDATE
In sqlldr, you can transform fields using standard SQL expressions:
LOAD DATA
INFILE 'data.dat'
BADFILE 'bad_orders.txt'
APPEND
INTO TABLE test_tab
FIELDS TERMINATED BY "|"
( info,
x_cord "regexp_substr(:x_cord, '[[:digit:]]+', 1, 1)",
)
The control file above will extract the first digit in the fields like (3, 4), but I cannot find a way to extract the second digit - ie I am not sure if it is possible to have the same field in the input file inserted into two columns.
If external tables are not an option for you, I would suggest either (1) transform the file before loading, using sed, awk, Perl etc or (2) SQLLDR the file into a temporary table and then have a second process to trandform the data and insert into your final table. Another option is to look at how the file is generated - could you generate it so that the field you need to transform is repeated in two fields in the file, eg:
data|(1, 2)|(1, 2)
Maybe someone else will chip in with a way to get sqlldr to do what you want.

Solved the problem after more research, because Oracle SQL*Loader has this feature, and it is used by specifying a column object, the following was the solution:
LOAD DATA
INFILE *
INTO TABLE my_table
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
id,
point column object
(
x,
y
)
)
BEGINDATA
1,3,5
2,7,2
3,6,2

Oracle: Import CSV file

I've been searching for a while now but can't seem to find answers so here goes...
I've got a CSV file that I want to import into a table in Oracle (9i/10i).
Later on I plan to use this table as a lookup for another use.
This is actually a workaround I'm working on since the fact that querying using the IN clause with more that 1000 values is not possible.
How is this done using SQLPLUS?
Thanks for your time! :)

SQL Loader helps load csv files into tables: SQL*Loader
If you want sqlplus only, then it gets a bit complicated. You need to locate your sqlloader script and csv file, then run the sqlldr command.

Another solution you can use is SQL Developer.
With it, you have the ability to import from a csv file (other delimited files are available).
Just open the table view, then:
choose actions
import data
find your file
choose your options.
You have the option to have SQL Developer do the inserts for you, create an sql insert script, or create the data for a SQL Loader script (have not tried this option myself).
Of course all that is moot if you can only use the command line, but if you are able to test it with SQL Developer locally, you can always deploy the generated insert scripts (for example).
Just adding another option to the 2 already very good answers.

An alternative solution is using an external table: http://www.orafaq.com/node/848
Use this when you have to do this import very often and very fast.

SQL Loader is the way to go.
I recently loaded my table from a csv file,new to this concept,would like to share an example.
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION
)
Place the control file and csv at the same location on the server.
Locate the sqlldr exe and invoce it.
sqlldr userid/passwd#DBname control=
Ex : sqlldr abc/xyz#ora control=load.ctl
Hope it helps.

Somebody asked me to post a link to the framework! that I presented at Open World 2012. This is the full blog post that demonstrates how to architect a solution with external tables.

I would like to share 2 tips: (tip 1) create a csv file (tip 2) Load rows from a csv file into a table.
====[ (tip 1) SQLPLUS to create a csv file form an Oracle table ]====
I use SQLPLUS with the following commands:
set markup csv on
set lines 1000
set pagesize 100000 linesize 1000
set feedback off
set trimspool on
spool /MyFolderAndFilename.csv
Select * from MYschema.MYTABLE where MyWhereConditions ;
spool off
exit
====[tip 2 SQLLDR to load a csv file into a table ]====
I use SQLLDR and a csv ( comma separated ) file to add (APPEND) rows form the csv file to a table.
the file has , between fields text fields have " before and after the text
CRITICAL: if last column is null there is a , at the end of the line
Example of data lines in the csv file:
11,"aa",1001
22,"bb',2002
33,"cc",
44,"dd",4004
55,"ee',
This is the control file:
LOAD DATA
APPEND
INTO TABLE MYSCHEMA.MYTABLE
fields terminated by ',' optionally enclosed by '"'
TRAILING NULLCOLS
(
CoulmnName1,
CoulmnName2,
CoulmnName3
)
This is the command to execute sqlldr in Linux. If you run in Windows use \ instead of / c:
sqlldr userid=MyOracleUser/MyOraclePassword#MyOracleServerIPaddress:port/MyOracleSIDorService DATA=datafile.csv CONTROL=controlfile.ctl LOG=logfile.log BAD=notloadedrows.bad
Good luck !

From Oracle 18c you could use Inline External Tables:
Inline external tables enable the runtime definition of an external table as part of a SQL statement, without creating the external table as persistent object in the data dictionary.
With inline external tables, the same syntax that is used to create an external table with a CREATE TABLE statement can be used in a SELECT statement at runtime. Specify inline external tables in the FROM clause of a query block. Queries that include inline external tables can also include regular tables for joins, aggregation, and so on.
INSERT INTO target_table(time_id, prod_id, quantity_sold, amount_sold)
SELECT time_id, prod_id, quantity_sold, amount_sold
FROM EXTERNAL (
(time_id DATE NOT NULL,
prod_id INTEGER NOT NULL,
quantity_sold NUMBER(10,2),
amount_sold NUMBER(10,2))
TYPE ORACLE_LOADER
DEFAULT DIRECTORY data_dir1
ACCESS PARAMETERS (
RECORDS DELIMITED BY NEWLINE
FIELDS TERMINATED BY '|')
LOCATION ('sales_9.csv') REJECT LIMIT UNLIMITED) sales_external;

SQL*Loader - How can i ignore certain rows with a specific charactre

If i have a CSV file that is in the following format
"fd!","sdf","dsfds","dsfd"
"fd!","asdf","dsfds","dsfd"
"fd","sdf","rdsfds","dsfd"
"fdd!","sdf","dsfds","fdsfd"
"fd!","sdf","dsfds","dsfd"
"fd","sdf","tdsfds","dsfd"
"fd!","sdf","dsfds","dsfd"
Is it possible to exclude any row where the first column has an exclamation mark at the end of the string.
i.e. it should only load the following rows
"fd","sdf","rdsfds","dsfd"
"fd","sdf","tdsfds","dsfd"
Thanks

According to the Loading Records Based on a Condition section of the SQL*Loader Control File Reference (11g):
"You can choose to load or discard a logical record by using the WHEN clause to test a condition in the record."
So you'd need something like this:
LOAD DATA ... INSERT INTO TABLE mytable WHEN mycol1 NOT LIKE '%!'
(mycol1.. ,mycol2 ..)
But the LIKE operator is not available! You only have = and !=
Maybe you could try an External Table instead.

I'd stick a CONSTRAINT on the table, and just let them be rejected. Maybe delete them after load. Or a unix "grep -v" to clear them out the file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

hive, ask for files within specific range - hadoop

Put those file into a directory and create new table on top of it. Also Hive has INPUTFILENAME virtual column, you can use it for filtering: where INPUTFILENAME like '%2018-01-03%' Also it is possible to use substr or regexp_extract to get date from filename , then use IN or >, < to filter them.

Related

external tables: how to make sure i don't load same file/data

Avoiding Data Duplication when Loading Data from Multiple Servers

bulk load UDT columns in Oracle

Oracle: Import CSV file

SQL*Loader - How can i ignore certain rows with a specific charactre

Categories

Resources