SQL Server Polybase with multiple files - polybase

I want to use PolyBase to read a directory of csv or xlsx files with similar schemas but different file names. File names has pattern such 'subjectXYZ_yyyy-mm-dd'.
The files are added daily and I don't want to create an External Table per file.
How I should set ODBC DSN and/or PolyBase DataSource/External Tables parameters for this?

Polybase / External tables support either single file names or folders in the LOCATION argument, but the files must be the same structure. A simple example using CETAS (but the principle is the same):
CREATE EXTERNAL TABLE ext.lineitem_1995
WITH (
LOCATION = 'enriched/tpch/tpch10/lineitem_partitioned/1995',
DATA_SOURCE = [MyDataSource],
FILE_FORMAT = [ParquetFF]
) AS
SELECT *
FROM dbo.lineitem
WHERE YEAR(l_shipdate) = 1995;

Related

updating data in external table

Lets assume the following scenario :
I have several users that will prepare .csv files (not being aware of each other so concurrency is possible).
The .csv file will always be in same format.
The data in the .csv file will contain a list of ids together with some other columns like update_date.
Based on that data i will create a procedure that will update data in real DB table.
The idea is to use external tables, to maximally simplify it for the .csv creators, so they will put files in a folder and stuff will be done for them, rest is my job.
The questions are :
Can i have several files as the source for 1 external table or i need 1 ext table for each file (and what i mean here is whenever there is new func call to load data from csv, it should be added to existing external table...so not all files are being loaded at once)
Can i update records/fields in external table.
External table basically allowes to query the data stored in the external file(s). So from this point you can't issue an UPDATE on it.
You can
1) add new files in the directory and ALTER the table
ALTER TABLE my_ex LOCATION ('file1.csv','file2.csv');
2) you can of course modify the existing files as well. There is no database state of the external table, each SELECT loads the data in the database, so you will always see the "updated" status.
** UPDATE **
An attempt to modify (e.g. UPDATE) leads to ORA-30657 operation not supported on external organized table.
To be able to maintain status in the database the data must be first copied in a regular table (CTAS - create table as select from the external table).

How to point one Hive Table to Multiple External Files?

I would like to be able to append multiple HDFS files to one Hive table while leaving the HDFS files in their original directory. These files are created are located in different directories.
The LOAD DATA INPATH moves the HDFS file to the hive warehouse directory.
As far as I can tell, an External Table must be pointed to one file, or to one directory within which multiple files with the same schema can be placed. However, my files would not be underneath a single directory.
Is it possible to point a single Hive table to multiple external files in separate directories, or to otherwise copy multiple files into a single hive table without moving the files from their original HDFS location?
Expanded Solution off of Pradeep's answer:
For example, my files look like this:
/root_directory/<job_id>/input/<dt>
Pretend the schema of each is (foo STRING, bar STRING, job_id STRING, dt STRING)
I first create an external table. However, note that my DDL does not contain an initial location, and it does not include the job_id and dt fields:
CREATE EXTERNAL TABLE hivetest (
foo STRING,
bar STRING
) PARTITIONED BY (job_id STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
Let's say I have two files I wish to insert located at:
/root_directory/b1/input/2014-01-01
/root_directory/b2/input/2014-01-02
I can load these two external files into the same Hive table like so:
ALTER TABLE hivetest
ADD PARTITION(job_id = 'b1', dt='2014-01-01')
LOCATION '/root_directory/b1/input/2014-01-01';
ALTER TABLE hivetest
ADD PARTITION(job_id = 'b2', dt='2014-01-02')
LOCATION '/root_directory/b2/input/2014-01-02';
If anyone happens to require the use of Talend to perform this, they can use the tHiveLoad component like so [edit: This doesn't work; check below]:
The code talend produces for this using tHiveLoad is actually LOAD DATA INPATH ..., which will remove the file off its original location in HDFS.
You will have to do the earlier ALTER TABLE syntax in a tHiveLoad instead.
The short answer is yes. A Hive External Table can be pointed to multiple files/directories. The long answer will depend on the directory structure of your data. The typical way you do this is to create a partitioned table with the partition columns mapping to some part of your directory path.
E.g. We have a use case where an external table points to thousands of directories on HDFS. Our paths conform to this pattern /prod/${customer-id}/${date}/. In each of these directories we have approx 100 files. In mapping this into a Hive Table, we created two partition columns, customer_id and date. So every day, we're able to load the data into Hive, by doing
ALTER TABLE x ADD PARTITION (customer_id = "blah", dt = "blah_date") LOCATION '/prod/blah/blah_date';
Try this:
LOAD DATA LOCAL INPATH '/path/local/file_1' INTO TABLE tablename;
LOAD DATA LOCAL INPATH '/path/local/file_2' INTO TABLE tablename;

How to Delete a 000000 file in S3 bucket in AWS using a hive script

I've created a working hive script to backup data from dynamodb to a file in S3 bucket in AWS. A code snippet is shown below
INSERT OVERWRITE DIRECTORY '${hiveconf:S3Location}'
SELECT *
FROM DynamoDBDataBackup;
When I run the hive script it probably deletes the old file and creates a new file but if there are errors in the backup process I guess it rolls back to the old data because the file is still there when an error has occurred.
Each day we want to make a backup but I need to know if an error has occurred so I want to delete the previous days backup first then create a backup. If it fails then there is no file in the folder which we can automatically detect.
The filename gets automatically named 000000
In my hive script I've tried unsuccesfully:
delete FILE '${hiveconf:S3Location}/000000'
and
delete FILE '${hiveconf:S3Location}/000000.0'
Perhaps the filename is wrong. I haven't set any permissions on the file.
I've just tried this but fails at STORED
SET dynamodb.endpoint= ${DYNAMODBENDPOINT};
SET DynamoDBTableName = "${DYNAMODBTABLE}";
SET S3Location = ${LOCATION};
DROP TABLE IF EXISTS DynamoDBDataBackupPreferenceStore;
CREATE TABLE IF NOT EXISTS DynamoDBDataBackupPreferenceStore(UserGuid STRING,PreferenceKey STRING,DateCreated STRING,DateEmailGenerated STRING,DateLastUpdated STRING,ReceiveEmail STRING,HomePage STRING,EmailFormat STRING,SavedSearchCriteria STRING,SavedSearchLabel STRING),
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
LOCATION '${hiveconf:S3Location}',
TBLPROPERTIES ("dynamodb.table.name" = ${hiveconf:DynamoDBTableName}, "dynamodb.column.mapping" = "UserGuid:UserGuid,PreferenceKey:PreferenceKey,DateCreated:DateCreated,DateEmailGenerated:DateEmailGenerated,DateLastUpdated:DateLastUpdated,ReceiveEmail:ReceiveEmail,HomePage:HomePage,EmailFormat:EmailFormat,SavedSearchCriteria:SavedSearchCriteria,SavedSearchLabel:SavedSearchLabel");
You manage files directly using Hive Table commands
Firstly if you want to use external data controlled outside Hive use the External Command when creating the table
set S3Path='s3://Bucket/directory/';
CREATE EXTERNAL TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
You can now insert data into this table
INSERT OVERWRITE TABLE S3table
SELECT data
FROM DynamoDBtable;
This will create text files in S3 inside the directory location
Note depending on the data size and number of reducers there may be multiple text files.
Files names are also random GUID element i.e. 03d3842f-7290-4a75-9c22-5cdb8cdd201b_000000
DROP TABLE S3table;
Dropping the table just breaks the link to the files
Now if you want to manage the directory you can create a table that will take control of the S3 directory (Note there is no external command)
CREATE TABLE IF NOT EXISTS S3table
( data STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION ${hiveconf:S3Path};
If you now issue a drop table command all files in the folder are delete immediately
DROP TABLE S3table;
I suggest you create a non external table then drop it and carry on with the rest of your script. If you encounter errors you will have a blank directory after the job finishes
Hope this covers what you need

Hadoop :Approach to load Local xml files from Share location to Hive

My requirement is to load XML files which are collected in to a network share folder by different sources into Hive. I need a confirmation with approach to follow.
As my understanding goes I have to
1. load all the files into HDFS first
2. Then using Mapreduce or sqoop transform xml files into required table then I have to load them into Hive.
Please suggest me any better approach if exists.
To proces and read XML files
Mahout has XML input format , see the blog post below to read more
https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
http://xmlandhadoop.blogspot.com.au/2010/08/xml-processing-in-hadoop.html
Pig has XMLLoader
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/piggybank/storage/XMLLoader.html
After processing with any of the above approach you can push then to Hive location.
Thanks
you do not required to copy data into the HDFS, you can directly load the data into the hive table using command,
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
filepath can be a
1.relative path, eg: project/data1
2.absolute path, eg: /user/hive/project/data1
3.a full URI with scheme and (optionally) an authority, eg:
hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case hive will move the file into the table) or it can be a directory (in which case hive will move all the files within that directory into the table). In either case filepath addresses a set of files.
If the keyword LOCAL is specified, then:
1.the load command will look for filepath in the local file system. If a relative path is specified - it will be interpreted relative to the current directory of the user. User can specify a full URI for local files as well - for example: file:///user/hive/project/data1
2.the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table.
If the keyword LOCAL is not specified, then
Hive will either use the full URI of filepath if one is specified. Otherwise the following rules are applied:
If scheme or authority are not specified, Hive will use the scheme and authority from hadoop configuration variable fs.default.name that specifies the Namenode URI.
If the path is not absolute - then Hive will interpret it relative to /user/
Hive will move the files addressed by filepath into the table (or partition)
if the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced with the files referred to by filepath. Otherwise the files referred by filepath will be added to the table.
Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath - then the existing file will be replaced with the new file.

Loading multiple concatenated CSV files into Qracle with SQLLDR

I have a dump of several Postgresql Tables in a selfcontained CSV file which I want to import into an Oracle Database with a matching schema. I found several posts on how to distribute data from one CSV "table" to multiple Oracle tables, but my problem is several DIFFERENT CVS "tables" in the same file.
Is it possible to specify table separators or somehow mark new tables in an SQLLDR control file, or do I have to split up the file manually before feeding it to SQLLDR?
That depends on your data. How do you determine which table a row is destined for? If you can determine which table base on data in the row, then it is fairly easy to do with a WHEN.
LOAD DATA
INFILE bunchotables.dat
INTO TABLE foo WHEN somecol = 'pick me, pick me' (
...column defs...
)
INTO TABLE bar WHEN somecol = 'leave me alone' (
... column defs
)
If you've got some sort of header row that determines the target table then you are going to have to split it before hand with another utility.

Resources