Execute Apache Spark (Scala) code in Bash script - bash

I am newbie to spark and scala.
I wanted to execute some spark code from inside a bash script. I wrote the following code.
Scala code was written in a separate .scala file as follows.
Scala Code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
println("x="+args(0),"y="+args(1))
}
}
This is the bash script that invokes the Apache-spark/scala code.
Bash Code
#!/usr/bin/env bash
Absize=File_size1
AdBsize=File_size2
for i in `seq 2 $ABsize`
do
for j in `seq 2 $ADsize`
do
Abi=`sed -n ""$i"p" < File_Path1`
Adj=`sed -n ""$j"p" < File_Path2`
scala SimpleApp.scala $Abi $adj
done
done
But then I get the following errors.
Errors:
error:object apache is not a member of package org
import org.apache.spark.SparkContext
^
error: object apache is not a member of package org
import org.apache.spark.SparkContext._
^
error: object apache is not a member of package org
import org.apache.spark.SparkConf
^
error: not found:type SparkConf
val conf = new SparkConf().setAppName("Simple Application") ^
error: not found:type SparkContext
The above code works perfectly if the scala file is written without any spark function (That is a pure scala file), but fails when there are apache-spark imports.
What would be a good way to run and execute this from bash script? Will I have to call spark shell to execute the code?

set up spark with environment variable and run as #puhlen told with spark-submit -class SimpleApp simple-project_2.11-1.0.jar $Abi $adj

Related

How to interact with a running python script

As a start ive got a basic script which reads local unix syslog (/var/log/messages)
i want to build a tool which opens a socket (19999) locally and allows admin commands to be sent / processed.
As something i can build on basically i want to have the script on start up do the follow :
- open port 19999 locally
- start reading syslog storing "line" as the last line it has processed.
- when admin command of "printline" is seen print last known variable for "line"
Ive got basics done i think (script is below) where i have it open the relevant ports and it prints the commands sent to it from another client tool however it never starts to read the syslog.
#!/usr/bin/python
import socket
import subprocess
import sys
import time
from threading import Thread
MAX_LENGTH = 4096
def handle(clientsocket):
while 1:
buf = clientsocket.recv(MAX_LENGTH)
if buf == '': return #client terminated connection
print buf
serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
PORT = 19999
HOST = '127.0.0.1'
serversocket.bind((HOST, PORT))
serversocket.listen(10)
while 1:
#accept connections from outside
(clientsocket, address) = serversocket.accept()
ct = Thread(target=handle, args=(clientsocket,))
ct.start()
def follow(thefile):
thefile.seek(0,2)
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
continue
yield line
if __name__ == '__main__':
logfile = open("/capture/log/uifitz/messages","r")
loglines = follow(logfile)
for line in loglines:
print line,
Any help would be appreciated. Python 2.6 by the way.

Multiprocessing Windows: how to prevent rerun?

I'm using this multiprocessing script:
Test class
from multiprocessing.pool import Pool
class Test:
def __init__(self):
print("init")
def getData(self, x, ID):
return x + str(ID)
def process(self):
pool = Pool(processes=10)
IDList = range(10)
data = []
for ID in IDList:
async_result = pool.apply_async(self.getData, ("wordl", ID))
data.append(async_result.get())
return data
main script
from TestClass import Test
def main():
test = Test()
test.process()
if __name__ == '__main__':
main()
When I run this the file, it keeps running. I found out that Windows is the problem here:
Just figured out the problem with this on Win 7 anaconda pyscripter
2.6.0 Pyscripter generates a .pyc file which is what is rerun everytime u run the program, just deleted it and it worked fine for me (source)
Is there a way to add code to this script so it will work on Windows? I tried using sys.dont_write_bytecode = True, but this didn't work.

How to pass command line input in Gatling using Scala script?

I want that user can input 'Count, repeatCount, testServerUrl and definitionId' from command line while executing from Gatling. From command line I execute
> export JAVA_OPTS="-DuserCount=1 -DflowRepeatCount=1 -DdefinitionId=10220101 -DtestServerUrl='https://someurl.com'"
> sudo bash gatling.sh
But gives following error:
url null/api/workflows can't be parsed into a URI: scheme
Basically null value pass there. Same happens to 'definitionId'. Following is the code. you can try with any url. you just have to check the value which you provides by commandline is shown or not?
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class TestCLI extends Simulation {
val userCount = Integer.getInteger("userCount", 1).toInt
val holdEachUserToWait = 2
val flowRepeatCount = Integer.getInteger("flowRepeatCount", 2).toInt
val definitionId = java.lang.Long.getLong("definitionId", 0L)
val testServerUrl = System.getProperty("testServerUrl")
val httpProtocol = http
.baseURL(testServerUrl)
.inferHtmlResources()
.acceptHeader("""*/*""")
.acceptEncodingHeader("""gzip, deflate""")
.acceptLanguageHeader("""en-US,en;q=0.8""")
.authorizationHeader(envAuthenticationHeaderFromPostman)
.connection("""keep-alive""")
.contentTypeHeader("""application/vnd.v7811+json""")
.userAgentHeader("""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36""")
val headers_0 = Map(
"""Cache-Control""" -> """no-cache""",
"""Origin""" -> """chrome-extension://faswwegilgnpjigdojojuagwoowdkwmasem""")
val scn = scenario("testabcd")
.repeat (flowRepeatCount) {
exec(http("asdfg")
.post("""/api/workflows""")
.headers(headers_0)
.body(StringBody("""{"definitionId":$definitionId}"""))) // I also want to get this value dynamic from CLI and put here
.pause(holdEachUserToWait)
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
}
Here no main method is defined so I think it would be difficult to pass the command line argument here. But for the work around what you can do is Read the property from the Environment variables.
For that you can find some help here !
How to read environment variables in Scala
In case of gatling See here : http://gatling.io/docs/2.2.2/cookbook/passing_parameters.html
I think this will get you done :
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class TestCLI extends Simulation {
val count = Integer.getInteger("users", 50)
val wait = 2
val repeatCount = Integer.getInteger("repeatCount", 2)
val testServerUrl = System.getProperty("testServerUrl")
val definitionId = java.lang.Long.getLong("definitionId", 0L)
val scn = scenario("testabcd")
.repeat (repeatCount ) {
exec(http("asdfg")
.post("""/xyzapi""")
.headers(headers_0)
.body(StringBody("""{"definitionId":$definitionId}"""))) // I also want to get this value dynamic from CLI and put here
.pause(wait)
}
setUp(scn.inject(atOnceUsers(count))).protocols(httpProtocol)
}
On the command line firstly export the JAVA_OPTS environment variable
by using this command directly in terminal.
export JAVA_OPTS="-Duse rCount=50 -DflowRepeatCount=2 -DdefinitionId=10220301 -DtestServerUrl='something'"
Windows 10 solution:
create simple my_gatling_with_params.bat file with content, e.g.:
#ECHO OFF
#REM You could pass to this script JAVA_OPTS in cammandline arguments, e.g. '-Dusers=2 -Dgames=1'
set JAVA_OPTS=%*
#REM Define this variable if you want to autoclose your .bat file after script is done
set "NO_PAUSE=1"
#REM To have a pause uncomment this line and comment previous one
rem set "NO_PAUSE="
gatling.bat -s computerdatabase.BJRSimulation_lite -nr -rsf c:\Work\gatling-charts-highcharts-bundle-3.3.1\_mydata\
exit
where:
computerdatabase.BJRSimulation_lite - your .scala script
users and games params that you want to pass to script
So in your computerdatabase.BJRSimulation_lite file you could use variables users and games in the following way:
package computerdatabase
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
import scala.util.Random
import java.util.concurrent.atomic.AtomicBoolean
class BJRSimulation_lite extends Simulation {
val httpProtocol = ...
val nbUsers = Integer.getInteger("users", 1).toInt
val nbGames = Integer.getInteger("games", 1).toInt
val scn = scenario("MyScen1")
.group("Play") {
//Set count of games
repeat(nbGames) {
...
}
}
// Set count of users
setUp(scn.inject(atOnceUsers(nbUsers)).protocols(httpProtocol))
}
After that you could just invoke 'my_gatling_with_params.bat -Dusers=2 -Dgames=1' to pass yours params into test

apache spark - check if file exists

I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing the data.
I checked the spark API and didnt find any method which checks if a file exists. Any ideas how to handle this?
The only method I found was sc.textFile(hdfs:///SUCCESS.txt).count() which would throw an exception when the file does not exist. I have to catch that exception and write my program accordingly. I didnt really like this approach. Hoping to find a better alternative.
For a file in HDFS, you can use the hadoop way of doing this:
val conf = sc.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))
For Pyspark, you can achieve this without invoking a subprocess using something like:
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check.
object OutputDirCheck {
def dirExists(hdfsDirectory: String): Boolean = {
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
fs.exists(new org.apache.hadoop.fs.Path(hdfsDirectory))
}
}
Using Databricks dbutils:
def path_exists(path):
try:
if len(dbutils.fs.ls(path)) > 0:
return True
except:
return False
for Spark 2.0 or higher you can use the method exist of hadoop.fr.FileSystem :
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
object Test extends App {
val spark = SparkSession.builder
.master("local[*]")
.appName("BigDataETL - Check if file exists")
.getOrCreate()
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
// This methods returns Boolean (true - if file exists, false - if file doesn't exist
val fileExists = fs.exists(new Path("<parh_to_file>"))
if (fileExists) println("File exists!")
else println("File doesn't exist!")
}
for Spark 1.6 to 2.0
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.{SparkConf, SparkContext}
object Test extends App {
val sparkConf = new SparkConf().setAppName(s"BigDataETL - Check if file exists")
val sc = new SparkContext(sparkConf)
val fs = FileSystem.get(sc.hadoopConfiguration)
val fileExists = fs.exists(new Path("<parh_to_file>"))
if (fileExists) println("File exists!")
else println("File doesn't exist!")
}
For Java coders;
SparkConf sparkConf = new SparkConf().setAppName("myClassname");
SparkContext sparky = new SparkContext(sparkConf);
JavaSparkContext context = new JavaSparkContext(sparky);
FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(context.hadoopConfiguration());
Path path = new Path(sparkConf.get(path_to_File));
if (!hdfs.exists(path)) {
//Path does not exist.
}
else{
//Path exist.
}
For pyspark python users:
i didn't find anything with python or pyspark so we need to execute hdfs command from python code. This has worked for me.
hdfs command to get if folder exisits : returning 0 if true
hdfs dfs -test -d /folder-path
hdfs command to get if file exists : returning 0 if true
hdfs dfs -test -d /folder-path
For putting this in python code i followed below lines of code :
import subprocess
def run_cmd(args_list):
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
proc.communicate()
return proc.returncode
cmd = ['hdfs', 'dfs', '-test', '-d', "/folder-path"]
code = run_cmd(cmd)
if code == 0:
print('folder exist')
print(code)
Output if folder exists :
folder exists
0
For PySpark:
from py4j.protocol import Py4JJavaError
def path_exist(path):
try:
rdd = sc.textFile(path)
rdd.take(1)
return True
except Py4JJavaError as e:
return False
#Nandeesh's answer evaluates all the Py4JJavaError exceptions. I propose to add another step to evaluate java exception error message:
from py4j.protocol import Py4JJavaError
def file_exists(path):
try:
spark.sparkContext.textFile(path).take(1)
except Py4JJavaError as e:
if 'org.apache.hadoop.mapred.InvalidInputException: Input path does not exist' in str(e.java_exception):
return False
else:
return True

Run shell commands in Scala code on Windows seems to require the full absolute path of the command

When I try to run shell commands on Mac, it worked as expected like this:
scala> import scala.sys.process._
import scala.sys.process._
scala> """protractor --version"""!
warning: there were 1 feature warning(s); re-run with -feature for details
Version 0.24.0
res12: Int = 0
scala>
But if I do it on Windows, I get this:
scala> import scala.sys.process._
import scala.sys.process._
scala> """protractor --version"""!
warning: there were 1 feature warning(s); re-run with -feature for details
java.io.IOException: Cannot run program "protractor": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
It seems like I have to do it like this on Windows:
scala> import scala.sys.process._
scala> """C:\Users\twer\AppData\Roaming\npm\protractor.cmd --version"""!
warning: there were 1 feature warning(s); re-run with -feature for details
Version 0.24.0
res11: Int = 0
scala>
I have to supply the full absolute path of the command.
But I am certain that the command is available in the path.
Is there anyway to avoid this?
You could try this:
val command = Seq("protractor", "--version")
val os = sys.props("os.name").toLowerCase
val panderToWindows = os match {
case x if x contains "windows" => Seq("cmd", "/C") ++ command
case _ => command
}
panderToWindows.!

Resources