Spark RDD map in yarn mode does not allow access to variables? - hadoop

I have got a brand new install of spark 1.2.1 over a mapr cluster and while testing it I find that it works nice in local mode but in yarn modes it seems not to be able to access variables, neither if broadcasted. To be precise, the following test code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object JustSpark extends App {
val conf = new org.apache.spark.SparkConf().setAppName("SimpleApplication")
val sc = new SparkContext(conf)
val a = List(1,3,4,5,6)
val b = List("a","b","c")
val bBC= sc.broadcast(b)
val data = sc.parallelize(a)
val transform = data map ( t => { "hi" })
transform.take(3) foreach (println _)
val transformx2 = data map ( t => { bBC.value.size })
transformx2.take(3) foreach (println _)
//val transform2 = data map ( t => { b.size })
//transform2.take(3) foreach (println _)
}
works in local mode but fails in yarn. More precisely, both methods, transform2 and transformx2, fail, and all of them work if --master local[8].
I am compiling it with sbt and sending with the submit tool
/opt/mapr/spark/spark-1.2.1/bin/spark-submit --class JustSpark --master yarn target/scala-2.10/simulator_2.10-1.0.jar
Any idea what is going on? The fail message just claims to have a java null pointer exception in the place where it should be accessing the variable. Is there other method to pass variables inside the RDD maps?

I'm going to take a pretty good guess: it's because you're using App. See https://issues.apache.org/jira/browse/SPARK-4170 for details. Write a main() method instead.

I presume the culprit were
val transform2 = data map ( t => { b.size })
In particular the accessing the local variable b . You may actually see in your log files java.io.NotSerializableException .
What is supposed to happen: Spark will attempt to serialize any referenced object. That means in this case the entire JustSpark class - since one of its members is referenced.
Why did this fail? Your class is not Serializable. Therefore Spark is unable to send it over the wire. In particular you have a reference to SparkContext - which does not extend Serializable
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
So - your first code - which does broadcast only the variable value - is the correct way.

This is the original example of broadcast, from spark sources, altered to use lists instead of arrays:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object MultiBroadcastTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Multi-Broadcast Test")
val sc = new SparkContext(sparkConf)
val slices = if (args.length > 0) args(0).toInt else 2
val num = if (args.length > 1) args(1).toInt else 1000000
val arr1 = (1 to num).toList
val arr2 = (1 to num).toList
val barr1 = sc.broadcast(arr1)
val barr2 = sc.broadcast(arr2)
val observedSizes: RDD[(Int, Int)] = sc.parallelize(1 to 10, slices).map { _ =>
(barr1.value.size, barr2.value.size)
}
observedSizes.collect().foreach(i => println(i))
sc.stop()
}}
I compiled it in my environment and it works.
So what is the difference?
The problematic example uses extends App while the original example is a plain singleton.
So I demoted the code to a "doIt()" function
object JustDoSpark extends App{
def doIt() {
...
}
doIt()
and guess what. It worked.
Surely the problem is related to Serialization indeed, but in a different way. Having the code in the body of the object seems to cause problems.

Related

How to convert an RDD (that read in a directory of text files) into dataFrame in Apache Spark in Scala?

I'm developing a Scala feature extracting app using Apache Spark TF-IDF. I need to read in from a directory of text files. I'm trying to convert an RDD to a dataframe but I'm getting the error "value toDF() is not a member of org.apache.spark.rdd.RDD[streamedRDD]". This is what I have right now ...
I have spark-2.2.1 & Scala 2.1.11. Thanks in advance.
Code:
// Creating the Spark context that will interface with Spark
val conf = new SparkConf()
.setMaster("local")
.setAppName("TextClassification")
val sc = new SparkContext(conf)
// Load documents (one per line)
val data = sc.wholeTextFiles("C:/Users/*")
val text = data.map{case(filepath,text) => text}
val id = data.map{case(filepath, text) => text.split("#").takeRight(1)(0)}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class dataStreamed(id: String, input: String)
val tweetsDF = data
.map{case (filepath, text) =>
val id = text.split("#").takeRight(1)(0)
val input = text.split(":").takeRight(2)(0)
dataStreamed(id, input)}
.as[dataStreamed]
.toDF()
.cache()
// -------------------- TF-IDF --------------------
// From spark.apache.org
// URL http://spark.apache.org/docs/latest/ml-features.html#tf-idf
val tokenizer = new Tokenizer().setInputCol("input").setOutputCol("words")
val wordsData = tokenizer.transform(tweetsDF)
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
val tf = hashingTF.transform(wordsData).cache() // Hashed words
// Compute for the TFxIDF
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val tfidf = idf.fit(tf)
Data: (Text files like these in a folder is what I need read in)
https://www.dropbox.com/s/cw3okhaosu7i1md/cars.txt?dl=0
https://www.dropbox.com/s/29tgqg7ifpxzwwz/Italy.txt?dl=0
The problem here is that map function returns a type of Dataset[Row] which you assign to tweetsDF. It should be:
case class dataStreamed(id: String, input: String)
def test() = {
val sparkConf = new SparkConf().setAppName("TextClassification").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
// Load documents (one per line)
val data = spark.sparkContext.wholeTextFiles("C:\\tmp\\stackoverflow\\*")
val dataset = spark.createDataset(data)
val tweetsDF = dataset
.map{case (id : String, input : String) =>
val file = id.split("#").takeRight(1)(0)
val content = input.split(":").takeRight(2)(0)
dataStreamed(file, content)}
.as[dataStreamed]
tweetsDF.printSchema()
tweetsDF.show(10)
}
First data will be an RDD(String, String) then I create a new Dataset with spark.createDataset in order to be able to use map properly together with the case class. Please note that you must define dataStreamedclass out of your method (test in this case)
Good luck
We can do this with couple of commands/functions:
Invoke the spark/scala shell, you can use driver-memory, executor-memory, executor-cores etc as suits your job
spark-shell
Read the text file from HDFS
val text_rdd = sc.textFile("path/to/file/on/hdfs")
Convert the text rdd to DataFrame
val text_df = text_rdd.toDF
Save as plan text format in HDFS
text_df.saveAsTextFile("path/to/hdfs")
Save as splittable compressed format in HDFS
text_df.coalesce(1).write.parquet("path/to/hdfs")

Efficient Sequence conversion to String

I have a string sequence Seq[String] which represents stdin input lines.
Those lines map to a model entity, but it is not guaranteed that 1 line = 1 entity instance.
Each entity is delimited with a special string that will not occur anywhere else in the input.
My solution was something like:
val entities = lines.mkString.split(myDelimiter).map(parseEntity)
parseEntity implementation is not relevant, it gets a String and maps to a case class which represents the model entity
The problem is with a given input, I get an OutOfMemoryException on the lines.mkString. Would a fold/foldLeft/foldRight be more efficient? Or do you have any better alternative?
You can solve this using akka streams and delimiter framing. See this section of the documentation for the basic approach.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Framing, Source}
import akka.util.ByteString
val example = (0 until 100).mkString("delimiter").grouped(8).toIndexedSeq
val framing = Framing.delimiter(ByteString("delimiter"), 1000)
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
Source(example)
.map(ByteString.apply)
.via(framing)
.map(_.utf8String)
.runForeach(println)
The conversion to and from ByteString is a bit annoying, but Framing.delimiter is only defined for ByteString.
If you are fine with a more pure functional approach, fs2 will also offer primitives to solve this problem.
Something that worked for me if you are reading from a stream (your mileage may vary). Slightly modified version of Scala LineIterator:
class EntityIterator(val iter: BufferedIterator[Char]) extends AbstractIterator[String] with Iterator[String] {
private[this] val sb = new StringBuilder
def getc() = iter.hasNext && {
val ch = iter.next
if (ch == '\n') false // Replace with your delimiter here
else {
sb append ch
true
}
}
def hasNext = iter.hasNext
def next = {
sb.clear
while (getc()) { }
sb.toString
}
}
val entities =
new EnityIterator(scala.io.Source.fromInputStream(...).iter.buffered)
entities.map(...)

How to pass command line input in Gatling using Scala script?

I want that user can input 'Count, repeatCount, testServerUrl and definitionId' from command line while executing from Gatling. From command line I execute
> export JAVA_OPTS="-DuserCount=1 -DflowRepeatCount=1 -DdefinitionId=10220101 -DtestServerUrl='https://someurl.com'"
> sudo bash gatling.sh
But gives following error:
url null/api/workflows can't be parsed into a URI: scheme
Basically null value pass there. Same happens to 'definitionId'. Following is the code. you can try with any url. you just have to check the value which you provides by commandline is shown or not?
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class TestCLI extends Simulation {
val userCount = Integer.getInteger("userCount", 1).toInt
val holdEachUserToWait = 2
val flowRepeatCount = Integer.getInteger("flowRepeatCount", 2).toInt
val definitionId = java.lang.Long.getLong("definitionId", 0L)
val testServerUrl = System.getProperty("testServerUrl")
val httpProtocol = http
.baseURL(testServerUrl)
.inferHtmlResources()
.acceptHeader("""*/*""")
.acceptEncodingHeader("""gzip, deflate""")
.acceptLanguageHeader("""en-US,en;q=0.8""")
.authorizationHeader(envAuthenticationHeaderFromPostman)
.connection("""keep-alive""")
.contentTypeHeader("""application/vnd.v7811+json""")
.userAgentHeader("""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36""")
val headers_0 = Map(
"""Cache-Control""" -> """no-cache""",
"""Origin""" -> """chrome-extension://faswwegilgnpjigdojojuagwoowdkwmasem""")
val scn = scenario("testabcd")
.repeat (flowRepeatCount) {
exec(http("asdfg")
.post("""/api/workflows""")
.headers(headers_0)
.body(StringBody("""{"definitionId":$definitionId}"""))) // I also want to get this value dynamic from CLI and put here
.pause(holdEachUserToWait)
}
setUp(scn.inject(atOnceUsers(userCount))).protocols(httpProtocol)
}
Here no main method is defined so I think it would be difficult to pass the command line argument here. But for the work around what you can do is Read the property from the Environment variables.
For that you can find some help here !
How to read environment variables in Scala
In case of gatling See here : http://gatling.io/docs/2.2.2/cookbook/passing_parameters.html
I think this will get you done :
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class TestCLI extends Simulation {
val count = Integer.getInteger("users", 50)
val wait = 2
val repeatCount = Integer.getInteger("repeatCount", 2)
val testServerUrl = System.getProperty("testServerUrl")
val definitionId = java.lang.Long.getLong("definitionId", 0L)
val scn = scenario("testabcd")
.repeat (repeatCount ) {
exec(http("asdfg")
.post("""/xyzapi""")
.headers(headers_0)
.body(StringBody("""{"definitionId":$definitionId}"""))) // I also want to get this value dynamic from CLI and put here
.pause(wait)
}
setUp(scn.inject(atOnceUsers(count))).protocols(httpProtocol)
}
On the command line firstly export the JAVA_OPTS environment variable
by using this command directly in terminal.
export JAVA_OPTS="-Duse rCount=50 -DflowRepeatCount=2 -DdefinitionId=10220301 -DtestServerUrl='something'"
Windows 10 solution:
create simple my_gatling_with_params.bat file with content, e.g.:
#ECHO OFF
#REM You could pass to this script JAVA_OPTS in cammandline arguments, e.g. '-Dusers=2 -Dgames=1'
set JAVA_OPTS=%*
#REM Define this variable if you want to autoclose your .bat file after script is done
set "NO_PAUSE=1"
#REM To have a pause uncomment this line and comment previous one
rem set "NO_PAUSE="
gatling.bat -s computerdatabase.BJRSimulation_lite -nr -rsf c:\Work\gatling-charts-highcharts-bundle-3.3.1\_mydata\
exit
where:
computerdatabase.BJRSimulation_lite - your .scala script
users and games params that you want to pass to script
So in your computerdatabase.BJRSimulation_lite file you could use variables users and games in the following way:
package computerdatabase
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
import scala.util.Random
import java.util.concurrent.atomic.AtomicBoolean
class BJRSimulation_lite extends Simulation {
val httpProtocol = ...
val nbUsers = Integer.getInteger("users", 1).toInt
val nbGames = Integer.getInteger("games", 1).toInt
val scn = scenario("MyScen1")
.group("Play") {
//Set count of games
repeat(nbGames) {
...
}
}
// Set count of users
setUp(scn.inject(atOnceUsers(nbUsers)).protocols(httpProtocol))
}
After that you could just invoke 'my_gatling_with_params.bat -Dusers=2 -Dgames=1' to pass yours params into test

Using par map to increase performance

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
}
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
}
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
}
})
}
}
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on
the size is 8000 which translates to 64'000'000 comparisons. I'm
using a synchronized block so that multiple threads are not writing
to same file at same time. A performance improvment im considering
is populating a separate collection within the inner loop ( userList.par.map(yUser => {)
and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can
handle a List that contains 8000 items instead of above example of 100 ?
I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.
EDIT:
One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.
Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...
Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

how to read immutable data structures from file in scala

I have a data structure made of Jobs each containing a set of Tasks. Both Job and Task data are defined in files like these:
jobs.txt:
JA
JB
JC
tasks.txt:
JB T2
JA T1
JC T1
JA T3
JA T2
JB T1
The process of creating objects is the following:
- read each job, create it and store it by id
- read task, retrieve job by id, create task, store task in the job
Once the files are read this data structure is never modified. So I would like that tasks within jobs would be stored in an immutable set. But I don't know how to do it in an efficient way. (Note: the immutable map storing jobs may be left immutable)
Here is a simplified version of the code:
class Task(val id: String)
class Job(val id: String) {
val tasks = collection.mutable.Set[Task]() // This sholud be immutable
}
val jobs = collection.mutable.Map[String, Job]() // This is ok to be mutable
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = new Job(line.trim)
jobs += (job.id -> job)
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = jobs(tokens(0).trim)
val task = new Task(job.id + "." + tokens(1).trim)
job.tasks += task
}
Thanks in advance for every suggestion!
The most efficient way to do this would be to read everything into mutable structures and then convert to immutable ones at the end, but this might require a lot of redundant coding for classes with a lot of fields. So instead, consider using the same pattern that the underlying collection uses: a job with a new task is a new job.
Here's an example that doesn't even bother reading the jobs list--it infers it from the task list. (This is an example that works under 2.7.x; recent versions of 2.8 use "Source.fromPath" instead of "Source.fromFile".)
object Example {
class Task(val id: String) {
override def toString = id
}
class Job(val id: String, val tasks: Set[Task]) {
def this(id0: String, old: Option[Job], taskID: String) = {
this(id0 , old.getOrElse(EmptyJob).tasks + new Task(taskID))
}
override def toString = id+" does "+tasks.toString
}
object EmptyJob extends Job("",Set.empty[Task]) { }
def read(fname: String):Map[String,Job] = {
val map = new scala.collection.mutable.HashMap[String,Job]()
scala.io.Source.fromFile(fname).getLines.foreach(line => {
line.split("\t") match {
case Array(j,t) => {
val jobID = j.trim
val taskID = t.trim
map += (jobID -> new Job(jobID,map.get(jobID),taskID))
}
case _ => /* Handle error? */
}
})
new scala.collection.immutable.HashMap() ++ map
}
}
scala> Example.read("tasks.txt")
res0: Map[String,Example.Job] = Map(JA -> JA does Set(T1, T3, T2), JB -> JB does Set(T2, T1), JC -> JC does Set(T1))
An alternate approach would read the job list (creating jobs as new Job(jobID,Set.empty[Task])), and then handle the error condition of when the task list contained an entry that wasn't in the job list. (You would still need to update the job list map every time you read in a new task.)
I did a feel changes for it to run on Scala 2.8 (mostly, fromPath instead of fromFile, and () after getLines). It may be using a few Scala 2.8 features, most notably groupBy. Probably toSet as well, but that one is easy to adapt on 2.7.
I don't have the files to test it, but I changed this stuff from val to def, and the type signatures, at least, match.
class Task(val id: String)
class Job(val id: String, val tasks: Set[Task])
// read tasks
val tasks = (
for {
line <- io.Source.fromPath("tasks.txt").getLines().toStream
tokens = line.split("\t")
jobId = tokens(0).trim
task = new Task(jobId + "." + tokens(1).trim)
} yield jobId -> task
).groupBy(_._1).map { case (key, value) => key -> value.map(_._2).toSet }
// read jobs
val jobs = Map() ++ (
for {
line <- io.Source.fromPath("jobs.txt").getLines()
job = new Job(line.trim, tasks(line.trim))
} yield job.id -> job
)
You could always delay the object creation until you have all the data read in from the file, like:
case class Task(id: String)
case class Job(id: String, tasks: Set[Task])
import scala.collection.mutable.{Map,ListBuffer}
val jobIds = Map[String, ListBuffer[String]]()
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = line.trim
jobIds += (job.id -> new ListBuffer[String]())
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = tokens(0).trim
val task = job.id + "." + tokens(1).trim
jobIds(job) += task
}
// create objects
val jobs = jobIds.map { j =>
Job(j._1, Set() ++ j._2.map { Task(_) })
}
To deal with more fields, you could (with some effort) make a mutable version of your immutable classes, used for building. Then, convert as needed:
case class Task(id: String)
case class Job(val id: String, val tasks: Set[Task])
object Job {
class MutableJob {
var id: String = ""
var tasks = collection.mutable.Set[Task]()
def immutable = Job(id, Set() ++ tasks)
}
def mutable(id: String) = {
val ret = new MutableJob
ret.id = id
ret
}
}
val mutableJobs = collection.mutable.Map[String, Job.MutableJob]()
// read jobs
for (line <- io.Source.fromFile("jobs.txt").getLines) {
val job = Job.mutable(line.trim)
jobs += (job.id -> job)
}
// read tasks
for (line <- io.Source.fromFile("tasks.txt").getLines) {
val tokens = line.split("\t")
val job = jobs(tokens(0).trim)
val task = Task(job.id + "." + tokens(1).trim)
job.tasks += task
}
val jobs = for ((k,v) <- mutableJobs) yield (k, v.immutable)
One option here is to have some mutable but transient configurer class along the lines of the MutableMap above but then pass this through in some immutable form to your actual class:
val jobs: immutable.Map[String, Job] = {
val mJobs = readMutableJobs
immutable.Map(mJobs.toSeq: _*)
}
Then of course you can implement readMutableJobs along the lines you have already coded

Resources