According to docs, this command should return table structure:
DESCRIBE schema.<table>
I tried:
-- Some CSV of 1000 lines (csv.extract_header=true)
DESCRIBE dfs.`sample_data.csv`
-- Empty result
-- Output format set to parquet
CREATE TABLE dfs.tmp.`test_parquet` AS SELECT * FROM dfs.`sample_data.csv`
-- > 1000 rows created
-- Querying newly created table
DESCRIBE dfs.tmp.`test_parquet`
-- Empty result
-- Querying my own parquet from somewhere else
DESCRIBE dfs.`another.parquet`
-- Empty result
REFRESH TABLE METADATA dfs.tmp.`test_parquet`
DESCRIBE dfs.tmp.`test_parquet`
-- Empty result
-- I've got postgres datasource connected as "pg"
DESCRIBE pg.public.MyPostgresTable
-- > Returns valid structure
It seems that DESCRIBE command does not describe parquets and CSV.
What I'm missing?
Drill v1.15
According to docs, this command may be used for views created in a workspace, tables created in Hive and HBase, or schemas.
DESCRIBE does not support tables created in a file system.
Related
I would like to export import tables from multiple schemas with DBMS_DATAPUMP API.
EG user1.table1 user2.table2 user3.table3
I give in a parameter the tables like a list with comma separated. 'user1.table1,user2.table2,user3.table3'
After that I store in a table the list of tables.
Then I read in a cursor the content of the table and go through in the cursor with a LOOP and give the schemas and table names one by one.
LOOP
dbms_datapump.metadata_filter(handle => h1, name => 'NAME_EXPR', value => 'IN('table1'));
dbms_datapump.metadata_filter(handle => h1, name => 'SCHEMA_LIST', value => 'IN('user1'));
END LOOP.
The first table is successfully added to the dbms_datapump job, but the second table exit with error.
ORA-39071: Value of SCHEMA_LIST is badly formed.
ORA-00936: missing exprension
I tired to find solutions how to exp/imp with DBMS_DATAPUMP API tables from different schemas but I have found any examples. The examples whih I found only shows if you are exp/imp from one schema.
Thanks in advance
--For Table mode, only a single SCHEMA_EXPR filter is supported. If specified, it must only specify a single schema (for example, 'IN (''SCOTT'')').
DBMS_DATAPUMP.METADATA_FILTER(handle, 'SCHEMA_EXPR', 'IN('||vschemas||')');
--you can enter more than one table name, but no more than 4000 characters (literal limit), including special characters.
DBMS_DATAPUMP.METADATA_FILTER(handle, 'NAME_EXPR', 'IN('||vtables||')', 'TABLE');
I want to use an external table to load a csv file as it's very convenient, but the problem is how do i make sure i don't load the same file twice in a row? i can't validate the data loaded because it can be the same information as before; i need to find a way to make sure the user doesnt load the same file as 2h ago for example.
I thought about uploading the file with a different name each time and issuing an alter table command to change the name of the file in the definition of the external table, but it sounds kinda risky.
I also thought about marking each row in the file with a sequence to help differentiate files, but i doubt the client would accept it as they would need to manually do this (the file is exported from somewhere).
Is there any better way to make sure i don't load the same file in the external table except changing the file's name and executing an alter on the table?
Thank you
when you bring the data from external table to your database you can use MERGE command instead of insert. it let you don't worry about duplicate data
see the blog about The Oracle Merge Command
What's more, we can wrap up the whole transformation process into this
one Oracle MERGE command, referencing the external table and the table
function in the one command as the source for the MERGED Oracle data.
alter session enable parallel dml;
merge /*+ parallel(contract_dim,10) append */
into contract_dim d
using TABLE(trx.go(
CURSOR(select /*+ parallel(contracts_file,10) full (contracts_file) */ *
from contracts_file ))) f
on d.contract_id = f.contract_id
when matched then
update set desc = f.desc,
init_val_loc_curr = f.init_val_loc_curr,
init_val_adj_amt = f.init_val_adj_amt
when not matched then
insert values ( f.contract_id,
f.desc,
f.init_val_loc_curr,
f.init_val_adj_amt);
So there we have it - our complex ETL function all contained within a
single Oracle MERGE statement. No separate SQL*Loader phase, no
staging tables, and all piped through and loaded in parallel
I can only think of a solution somewhat like this:
Have a timestamp encoded in the datafile name (like: YYYYMMDDHHMISS-file.csv), where YYYYMMDDHHMISS is the timestamp.
Create a table with the fields timestamp (as above).
Create a shell scripts that:
extracts the timestamp from the datafilename.
calls an sqlscript with the timestamp as the parameter, and return 0 if that timestamp does not exist, <>0 if the timestamp already exist, and in that case exit the script with the error: File: YYYYMMDDHHMISS-file.csv already loaded.
copy the YYYYMMDDDHHMISS-file.csv to input-file.csv.
run the sql loader script that loads the input-file.csv file
when succes: run a second sql script with parameter timestamp that inserts the record in the database to indicate that the file is loaded and move the original file to a backup folder.
when failure: report the failure of the load script.
I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
I have a piece of code that uses tables as well as PL/SQL tables and collections.
This piece of code runs for multiple sessions (multiple companies in our business terms)
create or replace TYPE TY_REC FORCE IS OBJECT
(
:
:
);
create or replace TYPE TY_TAB AS TABLE OF TY_REC ;
v_tab_nt.DELETE;
FETCH v_tab_cur BULK COLLECT INTO v_tab_nt;
CLOSE v_tab_cur ;
FOR i IN v_tab_nt.FIRST..v_tab_nt.LAST
LOOP
:
:
insert into xyz table --this table is present in multiple schema's
END LOOP;
This is working fine in my dev enviornment ,but today's in productions i can see v_tab_cur is fetching data from schema1 and inserting data into xyz table of schema2,which looks strange to be ,the amount of data is huge.
Can anyone make a guess of what is wrong with the bulk collect.
I want to load data from output of a program but not a existed data file. here's what I want:
CREATE TABLE MyTable (
X STRING);
INSERT OVERWRITE MyTable
BY PROGRAM "python MyProgram.py"; -- #!/usr/bin/python
-- print 'hello'
-- print 'world'
SELECT X FROM MyTable; -- I will get 2 records:
-- hello
-- world
but it seems hive doesn't provide such INSERT ... BY PROGRAM method. is there an alternative way to do that?
What I have used in the past is the Hadoop HDFS Rest Api (http://hadoop.apache.org/docs/r1.0.4/webhdfs.html). I run my program (.py) from the shell and then it pushes the data into HDFS/Hive via the Api. If your Hive table is already set up then you can overwrite the existing Hive file.
Another approach I have used is a program (.py, .sh etc) creates the data into a temp file and then you can add that file using a Hive command and delete the temp file afterward.
Edit:
In response to your comment saying you don't have access to the shell, you could perhaps try to utilize the custom Map/Reduce functionality in Hive (https://cwiki.apache.org/confluence/display/Hive/Tutorial scroll to the bottom to 'Custom Map/Reduce Scripts') and pass into the map-reduce some dummy data that is to be ignored and write your .py script in the form of a reduce and just get it to emit your required data instead.