Connecting NiFi to Vertica - apache-nifi

I'm trying to upload a CSV file grabbed from a SFTP server to Vertica as a new table. I got the GetSFTP processor configured - but I can't seem to understand how do I set up the connection with Vertica and execute the SQL?

1 - You need to setup a DBCPConnectionPool with your Vertica JAR(s) like #mattyb mentioned.
2 - Create a Staging Area where you will have your Executable(copy Scripts)
3 - Create a template to manage your Scripts or loads(ReplaceText Processor)
Note:
the parameters you see here come in the flow file from upstream proccesors.
this is reusable process group so there are many other PG`s that will have their output sent to this.
Example:
data_feed task will run a Start Data Feed (this PG will hold it`s own parameters and values) - if is executing with no error comes to this step, is it fail it goes to another reusable PG that handles Errors.
daily ingest process (Trickle load every 5 min), - a PG will prepare the CSV file, move it to staging, makes sure is all in the right format,- if is executing with no error comes to this step, is it fail it goes to another reusable PG that handles Errors.
And so on many PG`s will use this a Reusable PG to load Data in the DB
PG - Stand for Process Group
this is how mine looks
./home/dbadmin/.profile /opt/vertica/bin/vsql -U $username -w
$password -d analytics -c " copy ${TableSchema}.${TableToLoad} FROM
'${folder}/*.csv' delimiter '|' enclosed by '~' null as ' ' STREAM
NAME '${TableToLoad} ${TaskType}' REJECTED DATA AS TABLE
${TableSchema}.${TableToLoad}_Rejects; select
analyze_statistics('${TableSchema}.${TableToLoad}');"
-you can add you param as well or create new once
4 - Update Attribute Proc so you can name the executable.
5 - Putfile proc that will place the Vertica Load Script on the machine.
6 - ExecuteStreamComnand - this will run the shell script.
- audit logs and any other stuff can be done here.
Even Better - see the attached Template with of a reusable PG i use for me data loads into Vertica with NIFI.
http://www.aodba.com/bulk-load-data-vertica-apache-nifi/
As for the Vertica DBCP the setup should look like this:
where the red mark is you ipaddress:port/dbadmin
Note:
This DBCPConnectionPool can be at the project level (inside a PG) or a the NIFI level (create it in the main canvas using the Controller Services Menu)

Besides the bulk loader idea from #sKwa , you can also create a DBCPConnectionPool with your Vertica JAR(s) and a PutSQL processor that will execute the SQL. If you need to convert from data to SQL you can use ConvertJSONToSQL, otherwise use PutDatabaseRecord which is basically a "ConvertXToSQL -> PutSQL" together.

Related

Greenplum: gpfdist file serving

I'm running through Greenplum tutorial.
I'm having trouble understanding how gpfdist works.
What does this mean: gpfdist: Serves data files to or writes data files out from Greenplum Database segments.
What does it mean to "serve a file"? I thought it read external tables. Is gpfdist running on both the client and server? How does it work in parallel? Is it calling gpfdist on several hosts, is that how?
I just need help understanding the big picture. In this tutorial http://greenplum.org/gpdb-sandbox-tutorials/ we call it twice, why? (It's confusing because the server and client are on the same machine.)
gpfdist can run on any host. It is basically lighttpd that you point to a directory and it sits there and listens for connections on the port you specify.
On the greenplum server/database side, you create and external table definition that uses the LOCATION setting to your gpfdist location.
You can then query this table and gpfdist will "serve the file" to the database engine.
Read: http://gpdb.docs.pivotal.io/4380/utility_guide/admin_utilities/gpfdist.html
and http://gpdb.docs.pivotal.io/4380/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
An external table is made up of a few things and the two most important are the location of where to get (or put) data and the other is how to take that data and parse it into something that can be used as table records. When you create the external table you are just creating the definitions of how it should work.
When you execute a query against an external table only then do the segments go out and do what has been setup in that definition. It should be noted they aren't creating a persistent connection or caching that data. Each time you execute that query the cluster is going to look at it's definitions and move that data across the wire and use it for the length of that query.
In the case of gpfdist as an endpoint, it is really just a webserver. People frequently run one on an ETL node. When the location is gpfdist and you create a readable external table each segment will reach out to gpfdist and ask for a chunk of the file and process it. This is the parallelism, multiple segments reaching out to gpfdist and getting chunks they will then try to parse into a tuples according to what was specified in the table definition and then assemble it all to create a table of data for your query.
gpfist can also be the endpoint for a writable external table. In this case the segments are all going to push the data they have to that remote location and gpfdist is going to write the data it was pushed down to disk. The thing to note here is that there is no sort order promised, the data is written to disk as it's streamed from multiple segments.
yes, Gpfdist is file distribution service , it used for external tables .
An Green plum DB directly query a file like a table from a directory(Unix or windows)
We can select the flat file data and have the further processing. Unicode and wild characters also can be processed with predefined encoding .
External table concepts emerging with the help of gpfdist.
syntax to setup in windows
gpfdist -d ${FLAT_FILES_DIR} -p 8081 -l /tmp/gpfdist.8081.log
Just make sure u have gpdist.exe in yourparticular source machine

Build pipeline from Oracle DB to AWS DynamoDB

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.
I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

Writing autosys job information to Oracle DB

Here's my situation: we have no access to the autosys server other than using the autorep command. We need to keep detailed statistics on each of our jobs. I have written some Oracle database tables that will store start/end times, exit codes, JIL, etc.
What I need to know is what is the easiest way to output the data we require (which is all available in the autosys tables that we do not have access to) to an Oracle database.
Here are the technical details of our system:
autosys version - I cannot figure out how to get this information
Oracle version - 11g
We have two separate environments - one for UAT/QA/IT and several PROD servers
Do something like below
Create a table with the parameters you want to put. Put a key columns which should be auto generated. The jil column should be able to handle huge data. Also add one columns for sysdate.
Create a shell script. Inside it do as follows
"autorep -j -l0" to get all the jobs you want and put them in a file. -l0 is to ignore duplicate jobs. If a Box contain a job, then without -l0 you will get the job twice.
create a loop and read all the job names one by one.
In the loop, set varaibles for jobname/starttime/endtime/status (which all you can get from autorep -j . Then use a variable to hold jil by autorep -q -j
Append all these variable values in a flat file.
End the loop. After exiting a loop you wil end up with a file with all the job details.
Then use SQL loader to put the data in your oracle table. You can hardcode a control file and use it for every run. But the content of data file will change for every run.
Let me know if any part is not clear.

How to write datastage performance stats on a DB2 table?

My DataStage version is 8.5.
I have to populate a table in DB2 with the datastage performance data, something like job_name, start_time, finish_time and execution_date.
There is a master sequence with A LOT of jobs. The sequence itself runs once a day.
After every run of this sequence i must gather performance values and load them into a table on DB2, for reporting purposes.
I'm new on datastage and i dont have any idea of how to make it work. My Data stage's environment is Windows, so i cant work on it using shell scripts.
There is some way to get this info into datastage ?
i tried to build a server routine and get data using the DSGetJobInfo, but i got stuck into parameters issues (how to pass xx jobs as a list to that).
Sorry about my english, not my native language.
Thanks in advance.
Is your server also on Windows ? I am confused since you said "My Datastage "
most of thetime the servers are installed on linux / unix and clients are windows.
The best command to use would be (same should work on windows and linux servers both)
dsjob -jobinfo [project name ] [Job name ]
output would be something like-
Job Status : RUN OK (1)
Job Controller : not available
Job Start Time : Tue Mar 17 09:03:37 2015
Job Wave Number : 9
User Status : not available
Job Control : 0
Interim Status : NOT RUNNING (99)
Invocation ID : not available
Last Run Time : Tue Mar 17 09:09:00 2015
Job Process ID : 0
Invocation List : [job name]
Job Restartable : 0
After this years i found some ways to get a job's metadata, but none of them are good as i wanted, all of them are kind of clunky to implement, and fail often. I found 3 ways to get job metadata:
Query directly from xmeta, on tables that match the DATASTAGEX(*) naming
Query from DSODB, DSODB is the database from the operations console tool, it have all log information about job runs, but operations console must be enabled to have data (turn on the appwatcher process)
For this both above you can build an ETL that reads from these databases and write wherever you want.
And the last solution:
Call an after-job subroutine that call a script witch writes job's results on a custom table.
If this data is needed only to report and analyse, those first two solutions are just fine. For a more especific behavior, the third one is necessary.
What you are asking is the ETL audit process , which is one of the mainstays in ETL development . I am surprised that your ETL design does not already have one
Querying XMETA - In my experience across multiple Datastage environments . I have not seen companies use XMETA DB to pull out job performance information
Why ?? Because , Datastage jobs are not recommend to access XMETA DB , considering that XMETA holds the important metadata information about DS. Maybe your Datastage administrator will also not agree to provide access for XMETA .
The old and most trusted way of capturing run- meta information is to develop multliple- instance, run time column propagation transformations and also few audit tables in the database of your choice .
My idea:
1.Create table like - ETL-Run_Stats which has fields like JOB_NAME , STARTED_TS , FINISHED_TS , STATUS etc .
2. Now create your multiple instance jobs and include them in your DS master sequences .
If your DS sequence looks like this now
START ------> MAIN_DSJOB -------> SUCCESS
After your Audit jobs your DS sequence should look like this
START ----> AUDIT_JOB(started) -------> MAIN_DSJOB ------> AUDIT_JOB(finished) -------> SUCCESS
You can include as much functionalities you need in your AUDIT jobs to capture more runtime information
I am suggesting this only because your DS version is really old - version 8.5 .
With the newer versions of DS -- there are lot of in-built features to access this information. Maybe you can convince your Manager to upgrade DS :)
Let me know how it works

Running SQLLDR in DataStage

I was wondering, for folks familiar with DataStage, if Oracle SQLLDR can be used on DataStage. I have some sets of control files that I would like to incorporate into DataStage. A step by step way of accomplishing this will greatly be appreciated. Thanks
My guess is that you can run it with external stage in data stage.
You simply put the SQLLDR command in the external stage and it will be executed.
Try it and tell me what happens.
We can use ORACLE SQL Loader in DataStage .
If you check Oracle Docs there are two types of fast loading under SQL Loader
1) Direct Path Load - less validation in database side
2) Conventional Path Load
There is less validation in Direct Load if we compare to Conventional Load.
In SQL Loader process we have to specify points like
Direct or not
Parallel or not
Constraint and Index options
Control and Discard or Log files
In DataStage , we have Oracle Enterprise and Oracle Connector Stages
Oracle Enterprise -
we have load option in this stage to load data in fast mode and we can set Environment variable OPTIONS
for Oracle , example is below
OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)
Oracle Connector -
We have bulk load option for it and other properties related to SQL Loader are available in properties tab .
Example - control and discard file values all set by DataStage but you can set these properties and others manually.
As you know SQLLDR basically loads data from files to database so datastage allows you to use any input data file, that would take input in any data file like sequential file, pass them format, pass the schema of the table, and it’ll create an in memory template table, then you can use a database connecter like odbc or db2 etc. and that would load your data in your table, simple as that.
NOTE: if your table does not exist already at the backend then for first execution make it create then set it to append or truncate.
Steps:
Read the data from the file(Sequential File Stage)
Load it using the Oracle Connector(You could use Bulk load so that you could used direct load method using the SQL loader and the data file and control file settings can be configured manually). Bulk Load Operation: It receives records from the input link and passes them to Oracle database which formats them into blocks and appends the blocks to the target table as opposed to storing them in the available free space in the existing blocks.
You could refer the IBM documentation for more details.
Remember, there might be some restriction in loading when it comes to handling rejects, triggers or constraints when you use bulk load. It all depends on your requirement.

Resources