I have a spark kafka streaming job. Below is the main job processing logic.
val processedStream = rawStream.transform(x => {
var offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges;
val spark = SparkSession.builder.config(x.sparkContext.getConf).getOrCreate();
val parsedRDD = x.map(cr => cr.value());
var df = spark.sqlContext.read.schema(KafkaRawEvent.getStructure()).json(parsedRDD);
// Explode Events array as individual Event
if (DFUtils.hasColumn(df, "events")) {
// Rename the dow and hour
if (DFUtils.hasColumn(df, "dow"))
df = df.withColumnRenamed("dow", "hit-dow");
if (DFUtils.hasColumn(df, "hour"))
df = df.withColumnRenamed("hour", "hit-hour");
df = df
.withColumn("event", explode(col("events")))
.drop("events");
if (DFUtils.hasColumn(df, "event.key")) {
df = df.select(
"*", "event.key",
"event.count", "event.hour",
"event.dow", "event.sum",
"event.timestamp",
"event.segmentation");
}
if (DFUtils.hasColumn(df, "key")) {
df = df.filter("key != '[CLY]_view'");
}
df = df.select("*", "segmentation.*")
.drop("segmentation")
.drop("event");
if (DFUtils.hasColumn(df, "metrics")) {
df = df.select("*", "metrics.*").drop("metrics");
}
df = df.withColumnRenamed("timestamp", "eventTimeString");
df = df.withColumn("eventtimestamp", df("eventTimeString").cast(LongType).divide(1000).cast(TimestampType).cast(DateType))
.withColumn("date", current_date());
if (DFUtils.hasColumn(df, "restID")) {
df = df.join(broadcast(restroCached), df.col("restID") === restro.col("main_r_id"), "left_outer");
}
val SAVE_PATH = Conf.getSavePath();
//Write dataframe to file
df.write.partitionBy("date").mode("append").parquet(SAVE_PATH);
// filter out app launch events and group by adId and push to kafka
val columbiaDf = df.filter(col("adId").isNotNull)
.select(col("adId")).distinct().toDF("cui").toJSON;
// push columbia df to kafka for further processing
columbiaDf.foreachPartition(partitionOfRecords => {
val factory = columbiaProducerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
df.toJSON
.foreachPartition(partitionOfRecords => {
val factory = producerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
}
rawStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges);
df.toJSON.rdd;
});
val windowOneHourEveryMinute = processedStream.window(Minutes(60), Seconds(60));
windowOneHourEveryMinute.foreachRDD(windowRDD => ({
val prefix = Conf.getAnalyticsPrefixesProperties().getProperty("view.all.last.3.hours");
val viewCount = spark.sqlContext.read.schema(ProcessedEvent.getStructure()).json(windowRDD)
.filter("key == 'RestaurantView'")
.groupBy("restID")
.count()
.rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
spark.sparkContext.toRedisKV(viewCount, Conf.getRedisKeysTTL());
}));
streamingContext.start();
streamingContext.awaitTermination();
This job was running for almost a month without a single failure now today suddenly the processing time started increasing exponentially although there are no events being processed.
I am not able to figure out why it is happening. Below I am attaching screenshot of the application master.
Below is the graph for processing time
From the jobs tab in spark UI. Most of the time is spent on line .rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
Below is the DAG for the stage
This is just a small fraction of the DAG. The actual DAG is quite large but these same tasks are repeated upto total number of rdds in the window i.e, I am running a batch of 1 minute so a window of 30 minute is having 30 same repetitive tasks.
Is there any concrete reason why out of sudden the processing time started exponentially ?
Spark version: 2.2.0
Hadoop version: 2.7.3
NOTE: I am running this job on emr 5.8 cluster with 1 driver of 2.5G and 1 executor of 3.5GB.
Related
I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"
I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.
You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)
I'm trying to implement an Akka Stream that reads frames from a video file and applies a SVM Classifier in order to detect objects on each frame. The detection can run in parallel because the order of the video frames does not matter. My idea is to create a graph that follows the Akka Streams Cookbook (Balancing jobs to a fixed pool of workers) having two detection stages marked as .async.
It works to a certain extent as expected but I noticed that the memory pressure of my system (only 8 GB available) dramatically increases and is off-the-charts slowing down the system significantly. Comparing this with a different approach that uses .mapAsync (Akka Docs) integrating even three actors into the stream performing the object detection, the memory pressure is significantly lower.
What am I missing? Why does running two stages in parallel increase the memory pressure while three parallel running actors seem to work fine?
Additional remarks: I'm using OpenCV for reading the video file. Due to the 4K resolution, each video frame of type Mat is about 26.5 MB.
Running two stages in parallel with .async dramatically increasing memory pressure
implicit val materializer = ActorMaterializer(
ActorMaterializerSettings(actorSystem)
.withInputBuffer(initialSize = 1, maxSize = 1)
.withOutputBurstLimit(1)
.withSyncProcessingLimit(2)
)
val greyscaleConversion: Flow[Frame, Frame, NotUsed] =
Flow[Frame].map { el => Frame(el.videoPos, FrameTransformation.transformToGreyscale(el.frame)) }
val objectDetection: Flow[Frame, DetectedObjectPos, NotUsed] =
Flow.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val numberOfDetectors = 2
val frameBalance: UniformFanOutShape[Frame, Frame] = builder.add(Balance[Frame](numberOfDetectors, waitForAllDownstreams = true))
val detectionMerge: UniformFanInShape[DetectedObjectPos, DetectedObjectPos] = builder.add(Merge[DetectedObjectPos](numberOfDetectors))
for (i <- 0 until numberOfDetectors) {
val detectionFlow: Flow[Frame, DetectedObjectPos, NotUsed] = Flow[Frame].map { greyFrame =>
val classifier = new CascadeClassifier()
classifier.load("classifier.xml")
val detectedObjects: MatOfRect = new MatOfRect()
classifier.detectMultiScale(greyFrame.frame, detectedObjects, 1.08, 5, 0 | Objdetect.CASCADE_SCALE_IMAGE, new Size(40, 20), new Size(100, 80))
DetectedObjectPos(greyFrame.videoPos, detectedObjects)
}
frameBalance.out(i) ~> detectionFlow.async ~> detectionMerge.in(i)
}
FlowShape(frameBalance.in, detectionMerge.out)
})
def createGraph(videoFile: Video): RunnableGraph[NotUsed] = {
Source.fromGraph(new VideoSource(videoFile))
.via(greyscaleConversion).async
.via(objectDetection)
.to(Sink.foreach(detectionDisplayActor !))
}
Integrating actors with .mapAsync not increasing memory pressure
val greyscaleConversion: Flow[Frame, Frame, NotUsed] =
Flow[Frame].map { el => Frame(el.videoPos, FrameTransformation.transformToGreyscale(el.frame)) }
val detectionRouter: ActorRef =
actorSystem.actorOf(RandomPool(numberOfDetectors).props(Props[DetectionActor]), "detectionRouter")
val detectionFlow: Flow[Frame, DetectedObjectPos, NotUsed] =
Flow[Frame].mapAsyncUnordered(parallelism = 3)(el => (detectionRouter ? el).mapTo[DetectedObjectPos])
def createGraph(videoFile: Video): RunnableGraph[NotUsed] = {
Source.fromGraph(new VideoSource(videoFile))
.via(greyscaleConversion)
.via(detectionFlow)
.to(Sink.foreach(detectionDisplayActor !))
}
There is a table with two columns books and readers of these books, where books and readers are book and reader IDs, respectively :
books readers
1: 1 30
2: 2 10
3: 3 20
4: 1 20
5: 1 10
6: 2 30
Record book = 1, reader = 30 means that book with id = 1 was read by user with id = 30.
For each book pair I need to count number of readers who read both of these books, with this algorithm:
for each book
for each reader of the book
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
The advantage of using this algorithm is that it requires a small number of operations compared to counting all book combinations by two.
To implement the above algorithm I organize this data in two groups : 1) keyed by book, an RDD containing readers of each book and 2) keyed by reader, an RDD containing books read by each reader, such as in the following program:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.log4j.Logger
import org.apache.log4j.Level
object Small {
case class Book(book: Int, reader: Int)
case class BookPair(book1: Int, book2: Int, cnt:Int)
val recs = Array(
Book(book = 1, reader = 30),
Book(book = 2, reader = 10),
Book(book = 3, reader = 20),
Book(book = 1, reader = 20),
Book(book = 1, reader = 10),
Book(book = 2, reader = 30))
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val data = sc.parallelize(recs)
val bookMap = data.map(r => (r.book, r))
val bookGrps = bookMap.groupByKey
val readerMap = data.map(r => (r.reader, r))
val readerGrps = readerMap.groupByKey
// *** Calculate book pairs
// Iterate book groups
val allBookPairs = bookGrps.map(bookGrp => bookGrp match {
case (book, recIter) =>
// Iterate user groups
recIter.toList.map(rec => {
// Find readers for this book
val aReader = rec.reader
// Find all books (including this one) that this reader read
val allReaderBooks = readerGrps.filter(readerGrp => readerGrp match {
case (reader2, recIter2) => reader2 == aReader
})
val bookPairs = allReaderBooks.map(readerTuple => readerTuple match {
case (reader3, recIter3) => recIter3.toList.map(rec => ((book, rec.book), 1))
})
bookPairs
})
})
val x = allBookPairs.flatMap(identity)
val y = x.map(rdd => rdd.first)
val z = y.flatMap(identity)
val p = z.reduceByKey((cnt1, cnt2) => cnt1 + cnt2)
val result = p.map(bookPair => bookPair match {
case((book1, book2),cnt) => BookPair(book1, book2, cnt)
} )
val resultCsv = result.map(pair => resultToStr(pair))
resultCsv.saveAsTextFile("./result.csv")
}
def resultToStr(pair: BookPair): String = {
val sep = "|"
pair.book1 + sep + pair.book2 + sep + pair.cnt
}
}
This implemntation in fact results in the different, inefficient algorithm !:
for each book
find each reader of the book scanning all readers every time!
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
which contradicts the main goal of the discussed above algorithm because instead of reducing, it increases the number of operations. Finding user books requires filtering all users for every book. Thus number of operations ~ N * M where N - number of users and M - number of books.
Questions:
Is there any way to implement the original algorithm in Spark without filtering complete reader collection for every book?
Any other algorithms to compute book pair counts efficiently?
Also, when actually running this code I get filter exception which reason I can not figure out. Any ideas?
Please, see exception log below:
15/05/29 18:24:05 WARN util.Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
15/05/29 18:24:05 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/05/29 18:24:09 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/29 18:24:10 INFO Remoting: Starting remoting
15/05/29 18:24:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.0.2.15:38910]
15/05/29 18:24:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#10.0.2.15:38910]
15/05/29 18:24:12 ERROR executor.Executor: Exception in task 0.0 in stage 6.0 (TID 4)
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:282)
at Small$$anonfun$4$$anonfun$apply$1.apply(Small.scala:58)
at Small$$anonfun$4$$anonfun$apply$1.apply(Small.scala:54)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at Small$$anonfun$4.apply(Small.scala:54)
at Small$$anonfun$4.apply(Small.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Update:
This code:
val df = sc.parallelize(Array((1,30),(2,10),(3,20),(1,10)(2,30))).toDF("books","readers")
val results = df.join(
df.select($"books" as "r_books", $"readers" as "r_readers"),
$"readers" === $"r_readers" and $"books" < $"r_books"
)
.groupBy($"books", $"r_books")
.agg($"books", $"r_books", count($"readers"))
Gives the following result:
books r_books COUNT(readers)
1 2 2
So COUNT here is a number of times two books (here 1 and 2) were read together (count of pairs).
This kind of thing is a lot easier if you convert the original RDD to a DataFrame:
val df = sc.parallelize(
Array((1,30),(2,10),(3,20),(1,10), (2,30))
).toDF("books","readers")
Once you do that, just do a self-join on the DataFrame to make book pairs, then count how many readers have read each book pair:
val results = df.join(
df.select($"books" as "r_books", $"readers" as "r_readers"),
$"readers" === $"r_readers" and $"books" < $"r_books"
).groupBy(
$"books", $"r_books"
).agg(
$"books", $"r_books", count($"readers")
)
As for additional explanation about that join, note that I am joining df back onto itself -- a self-join: df.join(df.select(...), ...). What you are looking to do is to stitch together book #1 -- $"books" -- with a second book -- $"r_books", from the same reader -- $"reader" === $"r_reader". But if you joined only with $"reader" === $"r_reader", you would get the same book joined back onto itself. Instead, I use $"books" < $"r_books" to ensure that the ordering in the book pairs is always (<lower_id>,<higher_id>).
Once you do the join, you get a DataFrame with a line for every reader of every book pair. The groupBy and agg functions do the actual counting of the number of readers per book pairing.
Incidentally, if a reader read the same book twice, I believe you would end up with a double-counting, which may or may not be what you want. If that's not what you want just change count($"readers") to countDistinct($"readers").
If you want to know more about the agg functions count() and countDistinct() and a bunch of other fun stuff, check out the scaladoc for org.apache.spark.sql.functions
I am using spark streaming 1.1.0 locally (not in a cluster).
I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code:
def main(args : Array[String]){
val master = "local[8]"
val conf = new SparkConf().setAppName("Tester").setMaster(master)
val sc = new StreamingContext(conf, Milliseconds(110000))
val stream = sc.receiverStream(new MyReceiver("localhost", 9999))
val parsedStream = parse(stream)
parsedStream.foreachRDD(rdd =>
println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis()))
val result1 = parsedStream
.filter(entry => entry.symbol.contains("walking")
&& entry.symbol.contains("true") && entry.symbol.contains("id0"))
.map(_.time)
val result2 = parsedStream
.filter(entry =>
entry.symbol == "disappear" && entry.symbol.contains("id0"))
.map(_.time)
val result3 = result1
.transformWith(result2, (rdd1, rdd2: RDD[Int]) => rdd1.subtract(rdd2))
result3.foreachRDD(rdd =>
println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis()))
sc.start()
sc.awaitTermination()
}
def parse(stream: DStream[String]) = {
stream.flatMap { line =>
val entries = line.split("assert").filter(entry => !entry.isEmpty)
entries.map { tuple =>
val pattern = """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r
tuple match {
case pattern(symbol, time) =>
new Data(symbol, time.toInt)
}
}
}
}
case class Data (symbol: String, time: Int)
I have a batch duration of 110.000 milliseconds in order to receive all the data in one batch. I believed that, even locally, the spark is very fast. In this case, it takes about 3.5sec to execute the rule (between "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the expected time? Any advise
So i was using case matching in allot of my jobs and it killed performance, more than when i introduced a json parser. Also try tweaking the batch time on the StreamingContext. It made quite a bit of difference for me. Also how many local workers do you have?