mapValues function in DStream class Not Found - spark-streaming

I want to do some modifications on the StreamingKMeans algorithm provided in Spark Streaming, so I created a project containing the necessary files but unfortunately I can not find the mapValues function in the DStream class !
def predictOnValues[K: ClassTag](data: DStream[(K, Vector)]): DStream[(K, Int)] = {
assertInitialized()
data.mapValues(model.predict) //ERROR here !!!
}
Could someone tell me where can I find the mapValues function ?! thanks.

import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
should fix it.

Related

Is it possible to access the underlying org.apache.hadoop.mapreduce.Job from a Scalding job?

In my Scalding job, I have code like this:
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
class MyJob(args: Args) extends Job(args) {
FileInputFormat.setInputPathFilter(???, classOf[MyFilter])
// ... rest of job ...
}
class MyFilter extends PathFilter {
def accept(path:Path): Boolean = true
}
My problem is that the first argument of the FileInputFormat.setInputPathFilter method needs to be of type org.apache.hadoop.mapreduce.Job. How can I access the Hadoop job object in my Scalding job?
Disclaimer
There is no way to extract Job class. But you can (but should never do!) extract JobConf. After that you will be able to use FileInputFormat.setInputPathFilter from mapreduce.v1 API (org.apache.hadoop.mapred.JobConf) which will permit to archive the filtering.
But I suggest you not to do this. Read the end of the answer,
How can you do this?
Override stepStrategy method of scalding.Job to implement FlowStepStrategy. For example this implementation permits to change the name of mapreduce job
override def stepStrategy: Option[FlowStepStrategy[_]] = Some(new FlowStepStrategy[AnyRef]{
override def apply(flow: Flow[AnyRef], predecessorSteps: util.List[FlowStep[AnyRef]], step: FlowStep[AnyRef]): Unit =
step.getConfig match {
case conf: JobConf =>
# here you can modify the JobConf of each job.
conf.setJobName(...)
case _ =>
}
})
Why should one not do this?
Accessing JobConf to add a path filtering will work only if you are using the specific Sources and will break if you are using some others. Also you will be mixing different levels of abstraction. And I am not starting on how are you suppose to know what JobConf you actually need to modify (most of scalding jobs I saw are multi-steps)
How should one resolve this problem?
I suggest you to look closely on a type of Source you are using. I am pretty sure there is a function to apply a path filtering there during or before Pipe (or TypedPipe) construction.

How to allow spark to ignore missing input files?

I want to run a spark job (spark v1.5.1) over some generated S3 paths containing avro files. I'm loading them with:
val avros = paths.map(p => sqlContext.read.avro(p))
Some of the paths will not exist though. How can I get spark to ignore those empty paths? Previously I've used this answer, but I'm not sure how to use that with the new dataframe API.
Note: I'm ideally looking for a similar approach to the linked answer that just makes input paths optional. I don't particularly want to have to explicitly check for the existence of paths in S3 (since that's cumbersome and may make development awkward), but I guess that's my fallback if there's no clean way to implement this now.
I would use the scala Try type in order to handle the possibility of failure when reading a directory of avro files. With 'Try' we can make the possibility of failure explicit in our code, and handle it in a functional manner:
object Main extends App {
import scala.util.{Success, Try}
import org.apache.spark.{SparkConf, SparkContext}
import com.databricks.spark.avro._
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("example"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//the first path exists, the second one doesn't
val paths = List("/data/1", "/data/2")
//Wrap the attempt to read the paths in a Try, then use collect to filter
//and map with a single partial function.
val avros =
paths
.map(p => Try(sqlContext.read.avro(p)))
.collect{
case Success(df) => df
}
//Do whatever you want with your list of dataframes
avros.foreach{ df =>
println(df.collect())
}
sc.stop()
}

How to return Multiple Tuples Using Custom Loader Function in Pig

I have written a custom loader function by implementing the LoadFunc class.
Now I wanted to return multiple lines as input in the getNext() method.
I have used a DataBag like
databag.add(tuple1);
databag.add(tuple2);
then
tuple3.set(0,databag);
and return tuple3 in the getNext() method.
But I got an error
org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a bag to a String
Can you please guide how to proceed, and if the approach is incorrect, then how to approach this problem.
Thanks, cheers :))
If you want a bag with multiple tuples in it. Then this would be a way to generate it. Generate(set) the tuple first then add those tuples to the databag.
BagFactory bf=BagFactory.getInstance();
DataBag output=bf.newDefaultBag();
...
TupleFactory tp=TupleFactory.getInstance();
Tuple t1=tp.newTuple(2);
....
t1.set(0,key_out);
t1.set(1,value_out);
output.add(t1);
return output;

Sentence detection using opennlp on hadoop

I want to do sentence detection using OPenNLP and Hadoop. I have implemented same on Java successfully. Want to implement same on Mapreduce platform. Can anyone help me out?
I have done this two different ways.
One way is to push out your Sentence detection model to each node to a standard dir (ie /opt/opennlpmodels/), and at the class level in your mapper class read in the serialized model, and then use it appropriately in your map or reduce function.
Another way is to put the model in a database or the distributed cache (as a blob or something... I have used Accumulo to store Document categorization models before like this). then at the class level make the connection to the database and get the model as a bytearrayinputstream.
I have used Puppet to push out the models, but use whatever you typically use to keep files up to date on your cluster.
depending on your hadoop version you may be able to sneak the model in as a property on jobsetup and then only the master (or wherever you launch jobs from) will need to have the actual model file on it. I've never tried this.
If you need to know how to actually use the OpenNLP sentence detector let me know and I'll post an example.
HTH
import java.io.File;
import java.io.FileInputStream;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;
public class SentenceDetection {
SentenceDetector sd;
public Span[] getSentences(String docTextFromMapFunction) throws Exception {
if (sd == null) {
sd = new SentenceDetectorME(new SentenceModel(new FileInputStream(new File("/standardized-on-each-node/path/to/en-sent.zip"))));
}
/**
* this gives you the actual sentences as a string array
*/
// String[] sentences = sd.sentDetect(docTextFromMapFunction);
/**
* this gives you the spans (the charindexes to the start and end of each
* sentence in the doc)
*
*/
Span[] sentenceSpans = sd.sentPosDetect(docTextFromMapFunction);
/**
* you can do this as well to get the actual sentence strings based on the spans
*/
// String[] spansToStrings = Span.spansToStrings(sentPosDetect, docTextFromMapFunction);
return sentenceSpans;
}
}
HTH... just make sure the file is in place. There are more elegant ways of doing this but this works and it's simple.

How to write Scala 2.9 code that will allow dropping into an interpreter

I am not sure how to write code that will allow dropping into an interpreter into Scala 2.9 code. This question is a follow-up to this one, which asked what the Scala equivalent of,
import pdb
pdb.set_trace()
was from Python. The advice given there was primarily for Scala 2.8, and the related packages no longer exist in their previous form. Namely,
scala.nsc.tools.nsc.Interpreter.{break, breakIf} have been moved to scala.nsc.tools.nsc.interpreter.ILoop.{break, breakIf}
DebugParam is now NamedParam in scala.tools.nsc.interpreter
As noted in the original post, the class path of the parent process is not passed to the new interpreter automatically, so a workaround was presented here. Unfortunately, many of the classes/methods invoked there have now changed, and I'm not quite sure how to modify the code the behave as "expected".
Thanks!
EDIT: Here is my test code, which at current compiles and runs, but attempting to execute anything in the debugger results in the application freezing if compiled by scalac and executed by scala
import scala.tools.nsc.interpreter.ILoop._
object Main extends App {
case class C(a: Int, b: Double, c: String) {
def throwAFit(): Unit = {
println("But I don't wanna!!!")
}
}
// main
override def main(args: Array[String]): Unit = {
val c = C(1, 2.0, "davis")
0.until(10).foreach {
i =>
println("i = " + i)
breakIf(i == 5)
}
}
}
EDIT2: As my current setup is running through sbt, I have discovered that this topic is covered in the FAQ (bottom of the page). However, I do not understand the explanation given, and any clarification on MyType would be invaluable.
EDIT3: another discussion on the topic without a solution: http://permalink.gmane.org/gmane.comp.lang.scala.simple-build-tool/1622
So I know this is an old question, but if your REPL is hanging, I wonder if the problem is that you need to supply the -Yrepl-sync option? When my embedded REPL was hanging in a similar situation, that solved it for me.
To set -Yrepl-sync in an embedded REPL, instead of using breakIf you'll need to work with the ILoop directly so you can access the Settings object:
// create the ILoop
val repl = new ILoop
repl.settings = new Settings
repl.in = SimpleReader()
// set the "-Yrepl-sync" option
repl.settings.Yreplsync.value = true
// start the interpreter and then close it after you :quit
repl.createInterpreter()
repl.loop()
repl.closeInterpreter()

Resources