Azure Databricks scheduled jobs taking longer to respond - job-scheduling

In Azure DataBricks i have scheduled one job with notebook attached to simple python file.
[![dbutils.widgets.text("input", "","")
dbutils.widgets.get("input")
y = getArgument("input")
print ("Param -\'input':")
print (y)][1]][1]
Cluster: D8s_v3 ( 1 Worker)
even though its quite simple code its take about 9 to 10 second to execute by DataBricks Jobs. If i run python file directly it execute under 1 second.
Please guide me to optimize it for DataBricks Jobs
tg.png

Related

spring batch: alert with grafana & prometheus if a job failed in the last xx minutes

I am using spring batch (4.2.2.RELEASE) together with the spring actuator (2.2.6 RELEASE). Since version 4.2, spring batch provides support for batch monitoring and metrics based on micrometer (https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/monitoring-and-metrics.html).
For example i am able to see with the metric name spring_batch_job how often a job was executed, its status and duration.
I want to monitor this metric with grafana & prometheus and alert if a job failed in the last xx minutes.
If the spring batch application runs as a service it seems that it sums up all the metrics until the service is stopped. For example if a job was started 12 times in the last hour the metrics output could be the following:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 10.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 354.354538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
So two instances of the mainJob failed. Assumed in the next hour all 12 jobs will be successful, the metrics output would be:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 22.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 708.704538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
How am i able to check if a job failed in the last xx minutes? Because the following expression would still return the two failed job instances: spring_batch_job_seconds_count{status="FAILED"}[15m]
I'm not familiar with Prometheus QL but I will try to help.
What you can do is to calculate the difference of this counter between the last hour and the hour before. If you see an increase in the number of failed instances, then at least one instance has failed and you can raise an alert. Otherwise, no job has failed in the previous hour.
Prometheus provides the increase function that is designed specifically for that. So you should be able to answer your question and raise an alert when:
increase(spring_batch_job_seconds_count{name="mainJob",status="FAILED"}[15m]) > 0
As I said, I'm not expert at Prometheus, so I will let you check the syntax. But that's the idea.

Issues with mkdbfile in a simple "read a file > Create a hashfile job"

Hello datastage savvy people here.
Two days in a row, the same single datastage job failed (not stopping at all).
The job tries to create a hashfile using the command /logiciel/iis9.1/Server/DSEngine/bin/mkdbfile /[path to hashfile]/[name of hashfile] 30 1 4 20 50 80 1628
(last trace in the log)
Something to consider (or maybe not ? ) :
The [name of hashfile] directory exists, and was last modified at the time of execution) but the file D_[name of hashfile] does not.
I am trying to understand what happened to prevent the same incident to happen next run (tonight).
Previous to this day, this job is in production since ever, and we had no issue with it.
Using Datastage 9.1.0.1
Did you check the job log to see if captured an error? When a DataStage job executes any system command via command execution stage or similar methods, the stdout of the called job is captured and then added to a message in the job log. Thus if the mkdbfile command gives any output (success messages, errors, etc) it should be captured and logged. The event may not be flagged as an error in the job log depending on the return code, but the output should be there.
If there is no logged message revealing cause of directory non-create, a couple of things to check are:
-- Was target directory on a disk possibly out of space at that time?
-- Do you have any Antivirus software that scans directories on your system at certain times? If so, it can interfere with I/o. If it does a scan at same time you had the problem, you may wish to update the AV software settings to exclude the directory you were writing dbfile to.

Mahout - ParallelALSFactorizationJob running too long?

I am trying to run Mahout ALS recommendation on AWS EMR cluster, however, it takes much longer than I expected.
The following is the command I run:
aws add-steps --cluster-id <cluster_id> \
--steps Type=CUSTOM_JAR,\
Name="Mahout ALS Factorization Job",\
Jar=s3://<my_bucket>/recproto/mahout-mr-0.10.0-job.jar,\
MainClass=org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob,\
Args=["--input","s3://<my_bucket>/recproto/trainingdata/userClicks.csv.gz",\
"--output","s3://<my_bucket>/recproto/als-output/",\
"--implicitFeedback","true",\
"--lambda","150",\
"--alpha","0.05",\
"--numFeatures","100",\
"--numIterations","3",\
"--numThreadsPerSolver","4",\
"--usesLongIDs","true"]
In the userClicks.csv file, there are 1,567,808 ratings from 335,636 users and 23,934 items.
The job is run on a 10-c3.xlarge nodes EMR cluster, and the job has been running for more than 2 hours. I would like to know is this normal? In the case of my rating file, which scale of EMR cluster and parameters should I use so I can get a more acceptable running time?
I solved this problem by simply use Spark ALS. The training process spends LESS THAN 2 MINUTES ON MY LAPTOP on the same dataset with the same parameters.
I can now understand why some machine learning algorithms are deprecated due to performance issues...(e.g., the Minhash algorithm)

HDInsight Error

I am following the steps in the link shown below to use Hadoop 2.2 clusters with HDInsight. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-30/
In the "Run a Word Count Map Reduce Job"section I am having difficulty getting the message to take for step 4. In the PowerShell I type in the following commands:
Submit the job
Select-AzureSubscription $subscriptionName
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition
I keep getting an error that states there is a ParameterArgumentValidationError. What command could I use to avoid getting these errors?
I am new to using Azure and could really use some help :)
Those are two separate cmdlets:
The first one is:
Select-AzureSubscription $subscriptionName
If you only have only one subscription with your azure account, you can skip this cmdlet.
The second cmdlet is:
$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition

Hive execution hook

I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
Thanks,
Arun
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
HiveDriverRunHook.preDriverRun()
(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Driver.compile() starts processing the command: creates the abstract syntax tree
AbstractSemanticAnalyzerHook.preAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Semantic analysis
AbstractSemanticAnalyzerHook.postAnalyze()
(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK)
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecuteWithHookContext.run()
(HiveConf.ConfVars.PREEXECHOOKS)
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
(HiveConf.ConfVars.CLIENTSTATSPUBLISHERS)
If a task fails: ExecuteWithHookContext.run()
(HiveConf.ConfVars.ONFAILUREHOOKS)
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.

Resources