Elasticsearch to Spark Streaming - elasticsearch

I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.

You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)

Related

Number of partitions scanned(=32767) exceeds limit

I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"

Spark kafka streaming processing time increases exponentially

I have a spark kafka streaming job. Below is the main job processing logic.
val processedStream = rawStream.transform(x => {
var offsetRanges = x.asInstanceOf[HasOffsetRanges].offsetRanges;
val spark = SparkSession.builder.config(x.sparkContext.getConf).getOrCreate();
val parsedRDD = x.map(cr => cr.value());
var df = spark.sqlContext.read.schema(KafkaRawEvent.getStructure()).json(parsedRDD);
// Explode Events array as individual Event
if (DFUtils.hasColumn(df, "events")) {
// Rename the dow and hour
if (DFUtils.hasColumn(df, "dow"))
df = df.withColumnRenamed("dow", "hit-dow");
if (DFUtils.hasColumn(df, "hour"))
df = df.withColumnRenamed("hour", "hit-hour");
df = df
.withColumn("event", explode(col("events")))
.drop("events");
if (DFUtils.hasColumn(df, "event.key")) {
df = df.select(
"*", "event.key",
"event.count", "event.hour",
"event.dow", "event.sum",
"event.timestamp",
"event.segmentation");
}
if (DFUtils.hasColumn(df, "key")) {
df = df.filter("key != '[CLY]_view'");
}
df = df.select("*", "segmentation.*")
.drop("segmentation")
.drop("event");
if (DFUtils.hasColumn(df, "metrics")) {
df = df.select("*", "metrics.*").drop("metrics");
}
df = df.withColumnRenamed("timestamp", "eventTimeString");
df = df.withColumn("eventtimestamp", df("eventTimeString").cast(LongType).divide(1000).cast(TimestampType).cast(DateType))
.withColumn("date", current_date());
if (DFUtils.hasColumn(df, "restID")) {
df = df.join(broadcast(restroCached), df.col("restID") === restro.col("main_r_id"), "left_outer");
}
val SAVE_PATH = Conf.getSavePath();
//Write dataframe to file
df.write.partitionBy("date").mode("append").parquet(SAVE_PATH);
// filter out app launch events and group by adId and push to kafka
val columbiaDf = df.filter(col("adId").isNotNull)
.select(col("adId")).distinct().toDF("cui").toJSON;
// push columbia df to kafka for further processing
columbiaDf.foreachPartition(partitionOfRecords => {
val factory = columbiaProducerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
df.toJSON
.foreachPartition(partitionOfRecords => {
val factory = producerPool.value;
val producer = factory.getOrCreateProducer();
partitionOfRecords.foreach(record => {
producer.send(record);
});
});
}
rawStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges);
df.toJSON.rdd;
});
val windowOneHourEveryMinute = processedStream.window(Minutes(60), Seconds(60));
windowOneHourEveryMinute.foreachRDD(windowRDD => ({
val prefix = Conf.getAnalyticsPrefixesProperties().getProperty("view.all.last.3.hours");
val viewCount = spark.sqlContext.read.schema(ProcessedEvent.getStructure()).json(windowRDD)
.filter("key == 'RestaurantView'")
.groupBy("restID")
.count()
.rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
spark.sparkContext.toRedisKV(viewCount, Conf.getRedisKeysTTL());
}));
streamingContext.start();
streamingContext.awaitTermination();
This job was running for almost a month without a single failure now today suddenly the processing time started increasing exponentially although there are no events being processed.
I am not able to figure out why it is happening. Below I am attaching screenshot of the application master.
Below is the graph for processing time
From the jobs tab in spark UI. Most of the time is spent on line .rdd.map(r => (prefix + String.valueOf(r.get(0)), String.valueOf(r.get(1))));
Below is the DAG for the stage
This is just a small fraction of the DAG. The actual DAG is quite large but these same tasks are repeated upto total number of rdds in the window i.e, I am running a batch of 1 minute so a window of 30 minute is having 30 same repetitive tasks.
Is there any concrete reason why out of sudden the processing time started exponentially ?
Spark version: 2.2.0
Hadoop version: 2.7.3
NOTE: I am running this job on emr 5.8 cluster with 1 driver of 2.5G and 1 executor of 3.5GB.

join in Spark outputs wrong result whereas map-side join is correct

My spark version is 1.2.0, and here's the scenario:
There are two RDDs, namely RDD_A and RDD_B, whose data structure are all RDD[(spid, the_same_spid)]. RDD_A has 20,000 lines whereas RDD_B 3,000,000,000 lines. I intend to calculate line count of RDD_B whose 'spid' exists in RDD_A.
My first implementation is quite mainstream, applying join method from RDD_B on RDD_A:
val currentDay = args(0)
val conf = new SparkConf().setAppName("Spark-MonitorPlus-LogStatistic")
val sc = new SparkContext(conf)
//---RDD A transforming to RDD[(spid, spid)]---
val spidRdds = sc.textFile("/diablo/task/spid-date/" + currentDay + "-spid-media").map(line =>
line.split(",")(0).trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
val logRdds: RDD[(LongWritable, Text)] = MzFileUtils.getFileRdds(sc, currentDay, "")
val logMapRdds = MzFileUtils.mapToMzlog(logRdds)
//---RDD B transforming to RDD[(spid, spid)]---
val tongYuanRdd = logMapRdds.filter(kvs => kvs("plt") == "0" && kvs("tp") == "imp").map(kvs => kvs("p").trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
//---join---
val filteredTongYuanRdd = tongYuanRdd.join(spidRdds);
println("Total TongYuan Imp: " + filteredTongYuanRdd.count())
However, the result is incorrect (bigger than) when comparing to the hive's one. When changing the join method from reduce-side join to map-side join as below, the result is just the same as the hive's result:
val conf = new SparkConf().setAppName("Spark-MonitorPlus-LogStatistic")
val sc = new SparkContext(conf)
//---RDD A transforming to RDD[(spid, spid)]---
val spidRdds = sc.textFile("/diablo/task/spid-date/" + currentDay + "-spid-media").map(line =>
line.split(",")(0).trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
val logRdds: RDD[(LongWritable, Text)] = MzFileUtils.getFileRdds(sc, currentDay, "")
val logMapRdds = MzFileUtils.mapToMzlog(logRdds)
//---RDD B transforming to RDD[(spid, spid)]---
val tongYuanRdd = logMapRdds.filter(kvs => kvs("plt") == "0" && kvs("tp") == "imp").map(kvs => kvs("p").trim).map(spid => (spid, spid)).partitionBy(new HashPartitioner(32));
//---join---
val globalSpids = sc.broadcast(spidRdds.collectAsMap());
val filteredTongYuanRdd = tongYuanRdd.mapPartitions({
iter =>
val m = globalSpids.value
for {
(spid, spid_cp) <- iter
if m.contains(spid)
} yield spid
}, preservesPartitioning = true);
println("Total TongYuan Imp: " + filteredTongYuanRdd.count())
As you can see, the only difference between the above two code snippets is the 'join' part.
So, is there any suggestions on addressing this problem? Thanks in advance!
Spark's join doesn't enforce uniquiness of key, and when the key is duplicated actually outputs the cross product for that key. Using cogroup and only outputting on k/v pair for each key, or maping to just the ids and then using intersection will do the trick.

read json key-values with hive/sql and spark

I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})

Slow performance in spark streaming

I am using spark streaming 1.1.0 locally (not in a cluster).
I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code:
def main(args : Array[String]){
val master = "local[8]"
val conf = new SparkConf().setAppName("Tester").setMaster(master)
val sc = new StreamingContext(conf, Milliseconds(110000))
val stream = sc.receiverStream(new MyReceiver("localhost", 9999))
val parsedStream = parse(stream)
parsedStream.foreachRDD(rdd =>
println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis()))
val result1 = parsedStream
.filter(entry => entry.symbol.contains("walking")
&& entry.symbol.contains("true") && entry.symbol.contains("id0"))
.map(_.time)
val result2 = parsedStream
.filter(entry =>
entry.symbol == "disappear" && entry.symbol.contains("id0"))
.map(_.time)
val result3 = result1
.transformWith(result2, (rdd1, rdd2: RDD[Int]) => rdd1.subtract(rdd2))
result3.foreachRDD(rdd =>
println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis()))
sc.start()
sc.awaitTermination()
}
def parse(stream: DStream[String]) = {
stream.flatMap { line =>
val entries = line.split("assert").filter(entry => !entry.isEmpty)
entries.map { tuple =>
val pattern = """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r
tuple match {
case pattern(symbol, time) =>
new Data(symbol, time.toInt)
}
}
}
}
case class Data (symbol: String, time: Int)
I have a batch duration of 110.000 milliseconds in order to receive all the data in one batch. I believed that, even locally, the spark is very fast. In this case, it takes about 3.5sec to execute the rule (between "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the expected time? Any advise
So i was using case matching in allot of my jobs and it killed performance, more than when i introduced a json parser. Also try tweaking the batch time on the StreamingContext. It made quite a bit of difference for me. Also how many local workers do you have?

Resources