I am using spark streaming 1.1.0 locally (not in a cluster).
I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code:
def main(args : Array[String]){
val master = "local[8]"
val conf = new SparkConf().setAppName("Tester").setMaster(master)
val sc = new StreamingContext(conf, Milliseconds(110000))
val stream = sc.receiverStream(new MyReceiver("localhost", 9999))
val parsedStream = parse(stream)
parsedStream.foreachRDD(rdd =>
println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis()))
val result1 = parsedStream
.filter(entry => entry.symbol.contains("walking")
&& entry.symbol.contains("true") && entry.symbol.contains("id0"))
.map(_.time)
val result2 = parsedStream
.filter(entry =>
entry.symbol == "disappear" && entry.symbol.contains("id0"))
.map(_.time)
val result3 = result1
.transformWith(result2, (rdd1, rdd2: RDD[Int]) => rdd1.subtract(rdd2))
result3.foreachRDD(rdd =>
println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis()))
sc.start()
sc.awaitTermination()
}
def parse(stream: DStream[String]) = {
stream.flatMap { line =>
val entries = line.split("assert").filter(entry => !entry.isEmpty)
entries.map { tuple =>
val pattern = """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r
tuple match {
case pattern(symbol, time) =>
new Data(symbol, time.toInt)
}
}
}
}
case class Data (symbol: String, time: Int)
I have a batch duration of 110.000 milliseconds in order to receive all the data in one batch. I believed that, even locally, the spark is very fast. In this case, it takes about 3.5sec to execute the rule (between "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the expected time? Any advise
So i was using case matching in allot of my jobs and it killed performance, more than when i introduced a json parser. Also try tweaking the batch time on the StreamingContext. It made quite a bit of difference for me. Also how many local workers do you have?
Related
I have a spark kafka streaming job. Below is the main job processing logic.
val processedStream = rawStream.transform(x => {
var offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges;
val spark = SparkSession.builder.config(x.sparkContext.getConf).getOrCreate();
val parsedRDD = x.map(cr => cr.value());
var df = spark.sqlContext.read.schema(KafkaRawEvent.getStructure()).json(parsedRDD);
// Explode Events array as individual Event
if (DFUtils.hasColumn(df, "events")) {
// Rename the dow and hour
if (DFUtils.hasColumn(df, "dow"))
df = df.withColumnRenamed("dow", "hit-dow");
if (DFUtils.hasColumn(df, "hour"))
df = df.withColumnRenamed("hour", "hit-hour");
df = df
.withColumn("event", explode(col("events")))
.drop("events");
if (DFUtils.hasColumn(df, "event.key")) {
df = df.select(
"*", "event.key",
"event.count", "event.hour",
"event.dow", "event.sum",
"event.timestamp",
"event.segmentation");
}
if (DFUtils.hasColumn(df, "key")) {
df = df.filter("key != '[CLY]_view'");
}
df = df.select("*", "segmentation.*")
.drop("segmentation")
.drop("event");
if (DFUtils.hasColumn(df, "metrics")) {
df = df.select("*", "metrics.*").drop("metrics");
}
df = df.withColumnRenamed("timestamp", "eventTimeString");
df = df.withColumn("eventtimestamp", df("eventTimeString").cast(LongType).divide(1000).cast(TimestampType).cast(DateType))
.withColumn("date", current_date());
if (DFUtils.hasColumn(df, "restID")) {
df = df.join(broadcast(restroCached), df.col("restID") === restro.col("main_r_id"), "left_outer");
}
val SAVE_PATH = Conf.getSavePath();
//Write dataframe to file
df.write.partitionBy("date").mode("append").parquet(SAVE_PATH);
// filter out app launch events and group by adId and push to kafka
val columbiaDf = df.filter(col("adId").isNotNull)
.select(col("adId")).distinct().toDF("cui").toJSON;
// push columbia df to kafka for further processing
columbiaDf.foreachPartition(partitionOfRecords => {
val factory = columbiaProducerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
df.toJSON
.foreachPartition(partitionOfRecords => {
val factory = producerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
}
rawStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges);
df.toJSON.rdd;
});
val windowOneHourEveryMinute = processedStream.window(Minutes(60), Seconds(60));
windowOneHourEveryMinute.foreachRDD(windowRDD => ({
val prefix = Conf.getAnalyticsPrefixesProperties().getProperty("view.all.last.3.hours");
val viewCount = spark.sqlContext.read.schema(ProcessedEvent.getStructure()).json(windowRDD)
.filter("key == 'RestaurantView'")
.groupBy("restID")
.count()
.rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
spark.sparkContext.toRedisKV(viewCount, Conf.getRedisKeysTTL());
}));
streamingContext.start();
streamingContext.awaitTermination();
This job was running for almost a month without a single failure now today suddenly the processing time started increasing exponentially although there are no events being processed.
I am not able to figure out why it is happening. Below I am attaching screenshot of the application master.
Below is the graph for processing time
From the jobs tab in spark UI. Most of the time is spent on line .rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
Below is the DAG for the stage
This is just a small fraction of the DAG. The actual DAG is quite large but these same tasks are repeated upto total number of rdds in the window i.e, I am running a batch of 1 minute so a window of 30 minute is having 30 same repetitive tasks.
Is there any concrete reason why out of sudden the processing time started exponentially ?
Spark version: 2.2.0
Hadoop version: 2.7.3
NOTE: I am running this job on emr 5.8 cluster with 1 driver of 2.5G and 1 executor of 3.5GB.
I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.
You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)
I'm trying to slice observable stream by itself, eg.:
val source = Observable.from(1 to 10).share
val boundaries = source.filter(_ % 3 == 0)
val result = source.tumblingBuffer(boundaries)
result.subscribe((buf) => println(buf.toString))
Te output is:
Buffer()
Buffer()
Buffer()
Buffer()
source is probably iterated on boundaries line, before it reaches the result so it only create boundaries and resulting buffers but there's nothing to fill in.
My approach to this is using publish/connect:
val source2 = Observable.from(1 to 10).publish
val boundaries2 = source2.filter(_ % 3 == 0)
val result2 = source2.tumblingBuffer(boundaries2)
result2.subscribe((buf) => println(buf.toString))
source2.connect
This produces output alright:
Buffer(1, 2)
Buffer(3, 4, 5)
Buffer(6, 7, 8)
Buffer(9, 10)
Now I just need to hide connect from outer world and connect it when result gets subscribed (I am doing this inside a class and I don't want to expose it). Something like:
val source3 = Observable.from(1 to 10).publish
val boundaries3 = source3.filter(_ % 3 == 0)
val result3 = source3
.tumblingBuffer(boundaries3)
.doOnSubscribe(() => source3.connect)
result3.subscribe((buf) => println(buf.toString))
But now, doOnSubscribe action gets never called so published source gets never connected...
What's wrong?
You were on the right track with your publish solution. There is however an alternative publish operator that takes a lambda as its argument (see documentation) of type Observable[T] => Observable[R]. The argument of this lambda is the original stream, to which you can safely subscribe multiple times. Within the lambda you transform the original stream to your liking; in your case you filter the stream and buffer it on that filter.
Observable.from(1 to 10)
.publish(src => src.tumblingBuffer(src.filter(_ % 3 == 0)))
.subscribe(buf => println(buf.toString()))
The best thing of this operator is that you don't need to call anything like connect afterwards.
My spark version is 1.2.0, and here's the scenario:
There are two RDDs, namely RDD_A and RDD_B, whose data structure are all RDD[(spid, the_same_spid)]. RDD_A has 20,000 lines whereas RDD_B 3,000,000,000 lines. I intend to calculate line count of RDD_B whose 'spid' exists in RDD_A.
My first implementation is quite mainstream, applying join method from RDD_B on RDD_A:
val currentDay = args(0)
val conf = new SparkConf().setAppName("Spark-MonitorPlus-LogStatistic")
val sc = new SparkContext(conf)
//---RDD A transforming to RDD[(spid, spid)]---
val spidRdds = sc.textFile("/diablo/task/spid-date/" + currentDay + "-spid-media").map(line =>
line.split(",")(0).trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
val logRdds: RDD[(LongWritable, Text)] = MzFileUtils.getFileRdds(sc, currentDay, "")
val logMapRdds = MzFileUtils.mapToMzlog(logRdds)
//---RDD B transforming to RDD[(spid, spid)]---
val tongYuanRdd = logMapRdds.filter(kvs => kvs("plt") == "0" && kvs("tp") == "imp").map(kvs => kvs("p").trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
//---join---
val filteredTongYuanRdd = tongYuanRdd.join(spidRdds);
println("Total TongYuan Imp: " + filteredTongYuanRdd.count())
However, the result is incorrect (bigger than) when comparing to the hive's one. When changing the join method from reduce-side join to map-side join as below, the result is just the same as the hive's result:
val conf = new SparkConf().setAppName("Spark-MonitorPlus-LogStatistic")
val sc = new SparkContext(conf)
//---RDD A transforming to RDD[(spid, spid)]---
val spidRdds = sc.textFile("/diablo/task/spid-date/" + currentDay + "-spid-media").map(line =>
line.split(",")(0).trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
val logRdds: RDD[(LongWritable, Text)] = MzFileUtils.getFileRdds(sc, currentDay, "")
val logMapRdds = MzFileUtils.mapToMzlog(logRdds)
//---RDD B transforming to RDD[(spid, spid)]---
val tongYuanRdd = logMapRdds.filter(kvs => kvs("plt") == "0" && kvs("tp") == "imp").map(kvs => kvs("p").trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
//---join---
val globalSpids = sc.broadcast(spidRdds.collectAsMap());
val filteredTongYuanRdd = tongYuanRdd.mapPartitions({
iter =>
val m = globalSpids.value
for {
(spid, spid_cp) <- iter
if m.contains(spid)
} yield spid
}, preservesPartitioning = true);
println("Total TongYuan Imp: " + filteredTongYuanRdd.count())
As you can see, the only difference between the above two code snippets is the 'join' part.
So, is there any suggestions on addressing this problem? Thanks in advance!
Spark's join doesn't enforce uniquiness of key, and when the key is duplicated actually outputs the cross product for that key. Using cogroup and only outputting on k/v pair for each key, or maping to just the ids and then using intersection will do the trick.
There is a table with two columns books and readers of these books, where books and readers are book and reader IDs, respectively :
books readers
1: 1 30
2: 2 10
3: 3 20
4: 1 20
5: 1 10
6: 2 30
Record book = 1, reader = 30 means that book with id = 1 was read by user with id = 30.
For each book pair I need to count number of readers who read both of these books, with this algorithm:
for each book
for each reader of the book
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
The advantage of using this algorithm is that it requires a small number of operations compared to counting all book combinations by two.
To implement the above algorithm I organize this data in two groups : 1) keyed by book, an RDD containing readers of each book and 2) keyed by reader, an RDD containing books read by each reader, such as in the following program:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.log4j.Logger
import org.apache.log4j.Level
object Small {
case class Book(book: Int, reader: Int)
case class BookPair(book1: Int, book2: Int, cnt:Int)
val recs = Array(
Book(book = 1, reader = 30),
Book(book = 2, reader = 10),
Book(book = 3, reader = 20),
Book(book = 1, reader = 20),
Book(book = 1, reader = 10),
Book(book = 2, reader = 30))
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setAppName("Test")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val data = sc.parallelize(recs)
val bookMap = data.map(r => (r.book, r))
val bookGrps = bookMap.groupByKey
val readerMap = data.map(r => (r.reader, r))
val readerGrps = readerMap.groupByKey
// *** Calculate book pairs
// Iterate book groups
val allBookPairs = bookGrps.map(bookGrp => bookGrp match {
case (book, recIter) =>
// Iterate user groups
recIter.toList.map(rec => {
// Find readers for this book
val aReader = rec.reader
// Find all books (including this one) that this reader read
val allReaderBooks = readerGrps.filter(readerGrp => readerGrp match {
case (reader2, recIter2) => reader2 == aReader
})
val bookPairs = allReaderBooks.map(readerTuple => readerTuple match {
case (reader3, recIter3) => recIter3.toList.map(rec => ((book, rec.book), 1))
})
bookPairs
})
})
val x = allBookPairs.flatMap(identity)
val y = x.map(rdd => rdd.first)
val z = y.flatMap(identity)
val p = z.reduceByKey((cnt1, cnt2) => cnt1 + cnt2)
val result = p.map(bookPair => bookPair match {
case((book1, book2),cnt) => BookPair(book1, book2, cnt)
} )
val resultCsv = result.map(pair => resultToStr(pair))
resultCsv.saveAsTextFile("./result.csv")
}
def resultToStr(pair: BookPair): String = {
val sep = "|"
pair.book1 + sep + pair.book2 + sep + pair.cnt
}
}
This implemntation in fact results in the different, inefficient algorithm !:
for each book
find each reader of the book scanning all readers every time!
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
which contradicts the main goal of the discussed above algorithm because instead of reducing, it increases the number of operations. Finding user books requires filtering all users for every book. Thus number of operations ~ N * M where N - number of users and M - number of books.
Questions:
Is there any way to implement the original algorithm in Spark without filtering complete reader collection for every book?
Any other algorithms to compute book pair counts efficiently?
Also, when actually running this code I get filter exception which reason I can not figure out. Any ideas?
Please, see exception log below:
15/05/29 18:24:05 WARN util.Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
15/05/29 18:24:05 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/05/29 18:24:09 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/29 18:24:10 INFO Remoting: Starting remoting
15/05/29 18:24:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.0.2.15:38910]
15/05/29 18:24:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#10.0.2.15:38910]
15/05/29 18:24:12 ERROR executor.Executor: Exception in task 0.0 in stage 6.0 (TID 4)
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:282)
at Small$$anonfun$4$$anonfun$apply$1.apply(Small.scala:58)
at Small$$anonfun$4$$anonfun$apply$1.apply(Small.scala:54)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at Small$$anonfun$4.apply(Small.scala:54)
at Small$$anonfun$4.apply(Small.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Update:
This code:
val df = sc.parallelize(Array((1,30),(2,10),(3,20),(1,10)(2,30))).toDF("books","readers")
val results = df.join(
df.select($"books" as "r_books", $"readers" as "r_readers"),
$"readers" === $"r_readers" and $"books" < $"r_books"
)
.groupBy($"books", $"r_books")
.agg($"books", $"r_books", count($"readers"))
Gives the following result:
books r_books COUNT(readers)
1 2 2
So COUNT here is a number of times two books (here 1 and 2) were read together (count of pairs).
This kind of thing is a lot easier if you convert the original RDD to a DataFrame:
val df = sc.parallelize(
Array((1,30),(2,10),(3,20),(1,10), (2,30))
).toDF("books","readers")
Once you do that, just do a self-join on the DataFrame to make book pairs, then count how many readers have read each book pair:
val results = df.join(
df.select($"books" as "r_books", $"readers" as "r_readers"),
$"readers" === $"r_readers" and $"books" < $"r_books"
).groupBy(
$"books", $"r_books"
).agg(
$"books", $"r_books", count($"readers")
)
As for additional explanation about that join, note that I am joining df back onto itself -- a self-join: df.join(df.select(...), ...). What you are looking to do is to stitch together book #1 -- $"books" -- with a second book -- $"r_books", from the same reader -- $"reader" === $"r_reader". But if you joined only with $"reader" === $"r_reader", you would get the same book joined back onto itself. Instead, I use $"books" < $"r_books" to ensure that the ordering in the book pairs is always (<lower_id>,<higher_id>).
Once you do the join, you get a DataFrame with a line for every reader of every book pair. The groupBy and agg functions do the actual counting of the number of readers per book pairing.
Incidentally, if a reader read the same book twice, I believe you would end up with a double-counting, which may or may not be what you want. If that's not what you want just change count($"readers") to countDistinct($"readers").
If you want to know more about the agg functions count() and countDistinct() and a bunch of other fun stuff, check out the scaladoc for org.apache.spark.sql.functions