HDInsight Error - windows

I am following the steps in the link shown below to use Hadoop 2.2 clusters with HDInsight. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-30/
In the "Run a Word Count Map Reduce Job"section I am having difficulty getting the message to take for step 4. In the PowerShell I type in the following commands:
Submit the job
Select-AzureSubscription $subscriptionName
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition
I keep getting an error that states there is a ParameterArgumentValidationError. What command could I use to avoid getting these errors?
I am new to using Azure and could really use some help :)

Those are two separate cmdlets:
The first one is:
Select-AzureSubscription $subscriptionName
If you only have only one subscription with your azure account, you can skip this cmdlet.
The second cmdlet is:
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition

Related

How can i have 2 input link in DataStage sequence job?

As you can see that when the SEQ_DIM_ACCOUNT Job executed it has 2 conditions with Success and Failure.
I wanted to run execute_command_60 when it's failed, but if execute_command_60 has been run, then i wanted the execute_command_60 to get to the SEQ_DIM_BUSINESS_PARTNER, but when i tried to link the execute_command_60 to SEQ_DIM_BUSINESS_PARTNER it gave me an error "the destination stage cannot support any more input links"
Is there a way to do that?
Yes it is possible with the help of a Sequencer stage.
Add that after the Execute_Command and before the SEQ_DIM_BUSINESS_PARTNER. This Stage kan take any number of Input-Links and you only have to specify if All or Any input links have been run to go on

Difference between twarn and tassert in Talend

Please let me know the difference between the two components twarn and tassert in Talend?
Really I don't found the utility of the creation of these components separatley? What is the need/utility of each one on real use case.
twarn : This component provides a priority-rated message to the next component. It does not stop your Job in case of error. If you want to kill a Job in case of error, see tDie. (taken from Talend Help center docs)
tassert : This evaluate the status of a Job execution. It concludes with the boolean result based on an assertive statement related to the execution and feed the result to tAssertCatcher for proper Job status presentation.(taken from Talend Help center docs)
For use-case, please follow the below link:
Error Handling in Talend

storm rebalance command not updating the number of workers for a topology

I tried executing the following command for storm 1.1.1:
storm [topologyName] -n [number_of_worker]
The command successfully runs but the number of workers remain unchanged. I tried reducing the number of workers too. That also didn't work.
I have no clue whats happening. Any pointer will be helpful.
FYI:
i have implemented custom scheduling?. Is it because of that?
You can always check Storm's source code behind that CLI. Or code the re-balance (tested against 1.0.2):
RebalanceOptions rebalanceOptions = new RebalanceOptions();
rebalanceOptions.set_num_workers(newNumWorkers);
Nimbus.Client.rebalance("foo", rebalanceOptions);

Create Snapshot of FS from Spark Job

I would like to create a snapshot of the underlying HDFS, when running a spark job. The particular step involves deleting contents of some parquet files. I want to create a snapshot perform the delete operation, verify the operation results and proceed with next Steps.
However, I am unable to find a good way to access the HDFS API from my spark job. The directory I want to create a snapshot is tagged/marked snapshotable in HDFS. the command line method of creating the snapshot works, However I need to do this programmatically.
i am running Spark 1.5 on CDH 5.5.
any hints clues as to how I can perform this operation ?
Thanks
Ramdev
I have not verified this, but atleast I do not get Compile errors and in theory this solution should work.
This is scala code:
val sc = new SparkContext();
val fs = FileSystem.get(sc.hadoopConfig)
val snapshotPath = fs.createSnapshot("path to createsnapshot of","snapshot name")
.....
.....
if (condition satisfied) {
fs.deleteSnapshot(snapshotPath,"snapshot name")
}
I assume this will work in theory.

Hive execution hook

I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
Thanks,
Arun
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS)
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS)
If a task fails: ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS)
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.

Resources