Call R notebooks on Databricks from second R notebook - sparkr

I try to call a R notebook on Databricks while passing parameters using spark-submit.
My approach looks like this:
com <- "spark-submit foo.R p1 & spark-submit foo.R p2"
system(com)
This should call the script foo.Rand hand over the parameter p1.
This returns:
sh: 1: spark-submit: not found
sh: 1: spark-submit: not found
I would expect that this submits the two jobs to the Spark cluster. Any help what I am missing? Thanks!

I assume you attempted to run these commands in a R notebook. The standard way to call other notebooks from a Databricks notebook is dbutils.notebook.run. Currently it only works in Python and Scala.
You can work around it by adding a Python cell to your R notebook:
%python
dbutils.notebook.run("foo.R", 60, {"argument": "p1"})
dbutils.notebook.run("foo.R", 60, {"argument": "p2"})
In case you generate notebook parameters p1 and p2 in R, you can use a temporary view to pass them to the Python cell.

Related

Databricks init scripts halting

I am trying to install confluent kafka on my databrick drivers and using init scripts there.
I am using below command to write an script to DBFS like below:
%python
dbutils.fs.put("dbfs:/databricks/tmp/sample_n8.sh",
"""
#!/bin/bash
wget -P /dbfs/databricks/tmp/tmp1 http://packages.confluent.io/archive/1.0/confluent-1.0.1-2.10.4.zip
cd /dbfs/databricks/tmp/tmp1
unzip confluent-1.0.1-2.10.4.zip
cd confluent-1.0.1
./bin/zookeeper-server-start ./etc/kafka/zookeeper.properties &
exit 0
""")
The I edit my intiscripts and add en entry there to denote to above location
[![init scripts entry adding][1]][1]
However, when I try to run my cluster it nevers starts and it always halts. If I go to event log, it shows that it is stuck at 'Starting init scripts execution.'
I know there should be tweak in my script to run it on the background but even I am using & at the end of the start command for zookeper.
Can someone give me any hint how to resolve above?
[1]: https://i.stack.imgur.com/CncIL.png
EDIT: I guess this question could be the same if I ask how I can run my script in a %sh databricks cell while the cell can finish the running of above bash script, but at the moment it always telling me that the command is running

How execute a pipeline in bash using a python wrapper?

Scenario:
I have a pipeline in a bash script and a list of some process along with their arguments. I want to run a python script after the execution of each process (executable) in the pipeline if the process is in my list.
(I use Python 2.7)
My proposed solution:
Using a python wrapper script. I have replaced all executable in the pipeline with my custom python script which:
1) checks the process if is in the list then FLAG=True
2) execute the process by the original executable using subprocess.Popen(process.command, shell=True).communicate()
3) if FLAG==True then do something.
Problem:
Using the current solution when I run the process using
subprocess.Popen().communicate(), the processes will execute separately
and they cannot get the output of inner process (child process) to the outer process (parent).
For example:
#!/bin/bash
Mean=`P1 $Image1 -M`
P2 "$Image2" $Mean -F
We have not output value of Mean in the second line execution.
Second line will execute like:
subprocess.Popen("P2 $Image2 \nP1 $Image1 -M -F" , shell=True).communicate()
Therefore, it returns an error!
Is there a better way in python to execute process like this?
Please let me know if there is any other suggestion for this scenario (I'm a very beginner in bash).
There's no need to use bash at all.
Assuming modern Python 3.x:
#!/usr/bin/env python
import subprocess
image1 = sys.argv[1]
image2 = sys.argv[2]
p1 = subprocess.run(['P1', image1, '-M'], check=True, capture_output=True)
p2 = subprocess.run(['P2', image2, p1.stdout.strip(), '-F'], check=True, capture_output=True)
print(p2_result.stdout)
See here that we refer to p1.stdout.strip() where we need the mean value in p2's arguments.

Impala shell - Hive Beeline - "Argument list too long"

I have a Cloudera cluster on which multiple impala jobs are running all the time (i.e. cronjobs containing impala-shell commands). However I have a few INSERT INTO queries that are unusually long: they contain a lot of 'CASE...WHEN...THEN' lines. When these queries are run in impala-shell the command fails with an error "Argument list too long". They run just fine in Hue, but can't get them to run on the commandline.
Are there any workarounds for this?
I've tried running the command via Hive beeline (instead of impala) and setting 'hive.script.operator.truncate.env = true'. Beeline failed with the same error. I've tried if calling the query from a separate file made any difference (it doesn't). Can I save the 'CASE...WHEN...THEN' lines in separate vars perhaps (using 'set') and call those in the query? Or would that be another dead-end? A colleague mentioned a User Defined Function (UDF) might help, but I'm not sure. Opinions?

Need to pass Variable from Shell Action to Oozie Shell using Hive

All,
Looking to pass variable from shell action to the oozie shell. I am running commands such as this, in my script:
#!/bin/sh
evalDate="hive -e 'set hive.execution.engine=mr; select max(cast(create_date as int)) from db.table;'"
evalPartition=$(eval $evalBaais)
echo "evaldate=$evalPartition"
Trick being that it is a hive command in the shell.
Then I am running this to get it in oozie:
${wf:actionData('getPartitions')['evaldate']}
But it pulls a blank every time! I can run those commands in my shell fine and it seems to work but oozie does not. Likewise, if I run the commands on the other boxes of the cluster, they run fine as well. Any ideas?
The issue was configuration regarding to my cluster. When I ran as oozie user, I had write permission issues to /tmp/yarn. With that, I changed the command to run as:
baais="export HADOOP_USER_NAME=functionalid; hive yarn -hiveconf hive.execution.engine=mr -e 'select max(cast(create_date as int)) from db.table;'"
Where hive allows me to run as yarn.
The solution to your problem is to use "-S" switch in hive command for silent output. (see below)
Also, what is "evalBaais"? You might need to replace this with "evalDate". So your code should look like this -
#!/bin/sh
evalDate="hive -S -e 'set hive.execution.engine=mr; select max(cast(create_date as int)) from db.table;'"
evalPartition=$(eval $evalDate)
echo "evaldate=$evalPartition"
Now you should be able to capture the out.

Oozie shell Action - Running hive from shell issue

Based on a condition being true I am executing hive -e in shell script.It works fine.When I put this script in Shell action in Oozie and run ,I get a scriptName.sh: line 42: hive:command not found exception.
I tried passing the < env-var >PATH=/usr/lib/hive< /env-var> in the shell action, but I guess I am making some mistake there,because I get the same error scriptName.sh: line 42: hive:command not found
Edited:
I used which hive in the shell script. Its output is not consistent.I get two variations of output :
1. /usr/bin/hive along with a Delegation token can be issued only with kerberos or web authentication Java IOException."
2.which : hive not in {.:/sbin:/usr/bin:/usr/sbin:...}
Ok finally I figured it out .Might be a trivial thing for experts on Shell but can help someone starting out.
1. hive : command not found It was not a classpath issue.It was a shell issue.The environment i am running in is a korn shell (echo $SHELL to find out). But the hive script(/usr/lib/hive/bin/hive.sh) is a bash shell.So i changed the shebang (#! /bin/bash) in my script and it worked.
2.Delegation Token can only be issued with kerberos or web authentication.
In my hive script i added SET mapreduce.job.credentials.binary = ${HADOOP_TOKEN_FILE_LOCATION} HADOOP_TOKEN_FILE_LOCATION is a variable that holds the location of jobToken.This token needs to be passed for authentication of access to HDFS data(in my case,an HDFS read operation,through Hive Select query) in a secure cluster.Know more on Delegation Token Here .
Obviously, u miss shell environment variables.
To confirm it, use export in called shell by oozie.
If u use oozie call shell, a simple way is use /bin/bash -l your_script.
PS. PATH is a list of directories, so u need append ${HIVE_HOME}/bin to your PATH not ${HIVE_HOME}/bin/hive.

Resources