Running SQLLDR in DataStage - oracle

I was wondering, for folks familiar with DataStage, if Oracle SQLLDR can be used on DataStage. I have some sets of control files that I would like to incorporate into DataStage. A step by step way of accomplishing this will greatly be appreciated. Thanks

My guess is that you can run it with external stage in data stage.
You simply put the SQLLDR command in the external stage and it will be executed.
Try it and tell me what happens.

We can use ORACLE SQL Loader in DataStage .
If you check Oracle Docs there are two types of fast loading under SQL Loader
1) Direct Path Load - less validation in database side
2) Conventional Path Load
There is less validation in Direct Load if we compare to Conventional Load.
In SQL Loader process we have to specify points like
Direct or not
Parallel or not
Constraint and Index options
Control and Discard or Log files
In DataStage , we have Oracle Enterprise and Oracle Connector Stages
Oracle Enterprise -
we have load option in this stage to load data in fast mode and we can set Environment variable OPTIONS
for Oracle , example is below
OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)
Oracle Connector -
We have bulk load option for it and other properties related to SQL Loader are available in properties tab .
Example - control and discard file values all set by DataStage but you can set these properties and others manually.

As you know SQLLDR basically loads data from files to database so datastage allows you to use any input data file, that would take input in any data file like sequential file, pass them format, pass the schema of the table, and it’ll create an in memory template table, then you can use a database connecter like odbc or db2 etc. and that would load your data in your table, simple as that.
NOTE: if your table does not exist already at the backend then for first execution make it create then set it to append or truncate.

Steps:
Read the data from the file(Sequential File Stage)
Load it using the Oracle Connector(You could use Bulk load so that you could used direct load method using the SQL loader and the data file and control file settings can be configured manually). Bulk Load Operation: It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks.
You could refer the IBM documentation for more details.
Remember, there might be some restriction in loading when it comes to handling rejects, triggers or constraints when you use bulk load. It all depends on your requirement.

Related

How to load data into multiple tables, using data load feature from oracle apex

I have two tables, one for the client and one for the client locations.
I want to upload a csv file, using data loading and insert the data into the corresponding tables.
So there are two options:
Oracle Data Pump is a command line utility that can load large amounts of delimited text data very quickly.
If you have Oracle Application Express installed, you can use the Data Workshop to load data. This would be quick to use if this is just a one-off data load.

How to update data in CSV files in oracle stored procedure

Create a stored procedure that will read the .csv file from oracle server path using read file operation, query the data in some X table and write the output in .csv file.
here after read .csv file, compare .csv file data with table data and need to update few columns in .csv file.
Oracle works best with data in the database. UPDATE is one of the most frequently used commands.
But, modifying a file which resides in some directory seems to be somewhat out of scope. There are other programming languages you should use, I believe. However, if a hammer is the only tool you have, every problem looks like a nail.
I can think of two options.
One is to load file into the database. Use SQL*Loader to do that if file resides on your PC, or - if you have access to the database server and DBA granted you read/write privileges on a directory (an Oracle object which points to a filesystem directory) - use it as an external table. Once you load data, modify it and export it back (i.e. create a new CSV file) using spool.
Another option is to use UTL_FILE package. It also requires access to the database server's directory. Using the A(ppend) option, you can add rows to the original file, but I don't think that you can edit it so this option - at the end - finishes like the previous one - with creating a new file (but this time using UTL_FILE).
Conclusion? Don't use a database management system to modify files. Use another tool.

Hive: modify external table's location take too long

Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables.
Currently, to move external database from HDFS to Alluxio, I need to modify external table's location to alluxio://.
The statement is something like: alter table catalog_page set location "alluxio://node1:19998/user/root/tpcds/1000/catalog_returns"
According to my understanding, it should be a simple metastore modification,however, for some tables modification, it will spend dozens of minutes. The database itself contains about 1TB data btw.
Is there anyway for me to accelerate the table alter process? If no, why it's so slow? Any comment is welcomed, thanks.
I found suggested way which is metatool under $HIVE_HOME/bin.
metatool -updateLocation <new-loc> <old-loc> Update FS root location in the
metastore to new location.Both
new-loc and old-loc should be
valid URIs with valid host names
and schemes.When run with the
dryRun option changes are
displayed but are not persisted.
When run with the
serdepropKey/tablePropKey option
updateLocation looks for the
serde-prop-key/table-prop-key
that is specified and updates
its value if found.
By using this tool, the location modification is very fast. (maybe several seconds.)
Leave this thread here for anyone who might run into the same situation.

Connecting NiFi to Vertica

I'm trying to upload a CSV file grabbed from a SFTP server to Vertica as a new table. I got the GetSFTP processor configured - but I can't seem to understand how do I set up the connection with Vertica and execute the SQL?
1 - You need to setup a DBCPConnectionPool with your Vertica JAR(s) like #mattyb mentioned.
2 - Create a Staging Area where you will have your Executable(copy Scripts)
3 - Create a template to manage your Scripts or loads(ReplaceText Processor)
Note:
the parameters you see here come in the flow file from upstream proccesors.
this is reusable process group so there are many other PG`s that will have their output sent to this.
Example:
data_feed task will run a Start Data Feed (this PG will hold it`s own parameters and values) - if is executing with no error comes to this step, is it fail it goes to another reusable PG that handles Errors.
daily ingest process (Trickle load every 5 min), - a PG will prepare the CSV file, move it to staging, makes sure is all in the right format,- if is executing with no error comes to this step, is it fail it goes to another reusable PG that handles Errors.
And so on many PG`s will use this a Reusable PG to load Data in the DB
PG - Stand for Process Group
this is how mine looks
./home/dbadmin/.profile /opt/vertica/bin/vsql -U $username -w
$password -d analytics -c " copy ${TableSchema}.${TableToLoad} FROM
'${folder}/*.csv' delimiter '|' enclosed by '~' null as ' ' STREAM
NAME '${TableToLoad} ${TaskType}' REJECTED DATA AS TABLE
${TableSchema}.${TableToLoad}_Rejects; select
analyze_statistics('${TableSchema}.${TableToLoad}');"
-you can add you param as well or create new once
4 - Update Attribute Proc so you can name the executable.
5 - Putfile proc that will place the Vertica Load Script on the machine.
6 - ExecuteStreamComnand - this will run the shell script.
- audit logs and any other stuff can be done here.
Even Better - see the attached Template with of a reusable PG i use for me data loads into Vertica with NIFI.
http://www.aodba.com/bulk-load-data-vertica-apache-nifi/
As for the Vertica DBCP the setup should look like this:
where the red mark is you ipaddress:port/dbadmin
Note:
This DBCPConnectionPool can be at the project level (inside a PG) or a the NIFI level (create it in the main canvas using the Controller Services Menu)
Besides the bulk loader idea from #sKwa , you can also create a DBCPConnectionPool with your Vertica JAR(s) and a PutSQL processor that will execute the SQL. If you need to convert from data to SQL you can use ConvertJSONToSQL, otherwise use PutDatabaseRecord which is basically a "ConvertXToSQL -> PutSQL" together.

What is the best way to ingest data from Terdata into Hadoop with Informatica?

What is the best ways to parallel ingest data from Teradata database into Hadoop with parallel data moving?
If we create a job which is simple opens one session to Teradata database it will take a lot of time to load huge table.
if we create a set of sessions to load data in parallel, and also make Select in each of the sessions, than it will make a set of Full table scans Teradata to produce a data
What is the recommended best practice to load data in parallelised streams and make unnecessary workload to Teradata?
If Tera data supports table partitioning like oracle, you could try reading the table based on partitioning points which will enable parallelism in read...
Other option you have is, split the table into multiple partitions like adding a where clause on indexed column. This will ensure index scan and you can avoid full table scan.
The most scalable way to ingest data into Hadoop form teradata, which i found is to use Teradata connector for hadoop. It is included in Cloudera & Hortonworks distributions. I will show example base on Cloudera documentation, but the same works with Hortonworks as well:
Informatica big Data edition is using standard Scoop invocation via command line and submitting set of parameters to it. So the main question is - which driver to use to make parallel connections between two MPP systems.
Here is the link to the Cloudera documentation:
Using the Cloudera Connector Powered by Teradata
And here is the digest from this documentation (You could find that this connector support different kinds of load balancing between connections):
Cloudera Connector Powered by Teradata supports the following methods for importing data from Teradata to Hadoop:
split.by.amp
split.by.value
split.by.partition
split.by.hash
split.by.amp Method
This optimal method retrieves data from Teradata. The connector creates one mapper per available Teradata AMP, and each mapper subsequently retrieves data from each AMP. As a result, no staging table is required. This method requires Teradata 14.10 or higher.
If you use partition names in the select clause, Power Center will select only the rows within that partition so there won't be duplicate read (don't forget to choose Database partitioning in Informatica session level). However if you use key range partition you have to choose the range as you mentioned in settings. Usually we use NTILE oracle analytical function to split the table into multiple portions so that the read will be unique across the selects. Please let me know if you have any question. If you have range/auto generated/surrogate key column in the table use it in where clause - write a sub-query to divide the table into multiple portions.

Resources