How to identify which input need to pass for kettle Jobs when running in PDI tool? - transformation

We have many .ktr kettle jobs. we are new to these and We have found a way to run those in PDI tool. But we are unable to identify which input file need to pass for these kettle jobs when executing those in PDI(Pentaho Data Integration) tool.
Can anyone please explain how to understand or where to check that which input file need to pass for these kettle jobs during execution?

Related

How to get resources used for FINISHED hadoop jobs from YARN logs using job names?

I have a unix shell script which runs multiple hive scripts. I have given Job names for every hive queries inside the hive scripts.
What I need is that at the end of the shell script, I want to retrieve the resources (in terms of memory used,containers) used for the hive queries based on the job names from the YARN logs/application having appstatus as 'FINISHED'
How do I do this?
Any help would be appreciated.
You can pull this information from the Yarn History server via rest apis.
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html
Scroll through this documentation and you will see examples of how to get cluster level information on jobs executed and then how to get information on individual jobs.

Alarm notification in hive/hadoop

I have been using apache hive for a while. And I want to know if there is a way of setting alarms in hive, ie., I want to know if I can run a shell script or send an email if there is a job failure. My hive jobs generally take a couple of hours and I want to get an immediate notification if it failed so that I can take an action immediately if my job failed. Or atleast please tell me if I can setup similar alarms in hadoop?
When you call a hive script from Unix/Linux box, you use hive -(hyphen) and then your hive sql or hadoop thru unix script. So you execute your hive script or hadoop script from Unix/lInux box. So a shell script, which you need to use with mail or mailx is enough to alert you.
Cheers,
Raja
Available to help you in Teradata forum too
(ktraj1#gmail.com)

Run a Hadoop job without output file

Is it possible to run a hadoop job without specifying output file ?
When i try to run a hadoop job , no output file specified Exception is thrown .
can any one please give any procedure to do so using Java.
I am writing the data processed by reduce to a non relational database so i no longer require it to write to HDFS.
Unfortunately, you can't really do this. Writing output is part of the framework. When you work outside of the framework, you basically have to just deal with the consequences.
You can use NullOutputFormat, which doesn't write any data to HDFS. I think it still creates the folder, though. You could always let Hadoop create the folder, then delete it.

How can I run calculate and look after calculating process to remote Hadoop cluster?

I have a java program and I want to send task (jar) from it to remote
Hadoop. I need to pass special parameters to jar ofcourse.
If the calculating task just has ended java program must know this.
Can I do it through hadoop API?
Where can I get articles or someting also?
Hadoop has some API's for this. so if you write Java code for a Hadoop Job, you can define the job characteristics like:
job.SetMapperClass(),
job.setReducerClass(),
job.setPartitionerClass(),
job.setInputPath(),
etc..
then you run your job, and you can wait for the job to finish by using
job.waitForCompletion(true)

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources