snowflake performance with emr spark - performance

I need to process the data stored in s3 and store it in snowflake.
During several tests, I discovered a performance issue.
common config
Data A (stored in s3): 20GB
Data B (stored in s3 and snowflake): 8.5KB
Operation: left outer join
Using EMR(spark) r5.4xlarge(5)
when i read Data A and Data B(snowflake), it elapsed more than 1 hour, 12 mins
val Adf= spark.read.parquet("s3://path")
var sfOptions = Map.apply(
"sfURL" -> "XXXXX.us-east-1.snowflakecomputing.com",
"sfUser" -> XXXXX",
"sfPassword" -> "XXXX",
"sfDatabase" -> "XXX",
"sfSchema" -> "XXXX",
"sfWarehouse" -> "XXXX"
)
val Bdf: DataFrame = spark.sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable","XXXX")
.load()
val resultDF = Adf.join(Bdf, Seq("CNTY"), "leftouter")
resultDF.write
.fomat(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable","t_result_from spark")
.opton("parallelism","8")
.mode(SaveMode.Overwrite)
.save()
but when i read Data A and Data B(s3), it elapsed just 10 mins.
val Adf= spark.read.parquet("s3://path")
val Bdf spark.read.option("header","true").csv("s3://path")
val resultDF = Adf.join(Bdf, Seq("CNTY"), "leftouter")
resultDF.write
.fomat(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable","t_result_from spark_2")
.opton("parallelism","8")
.mode(SaveMode.Overwrite)
.save()
why is there a different performance between reading on S3 and reading on snowflake?

Related

Sql Window function on whole dataframe in spark

I am working on spark streaming project which consumes data from Kafka in every 3 minutes. I want to calculate moving sum of value. Below is the sample logic for a rdd which works well. I want to know will this logic work for spark streaming. I read some docs that you have to assign rang of data. ex - Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1) But I want to calculate the logic on whole dataframe. Does the below logic work for the whole value of dataframe or It will take only the range of value of dataframe.
val customers = spark.sparkContext.parallelize(List(("Alice", "2016-05-01", 50.00),
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec1 = Window.partitionBy("name").orderBy("date")
customers.withColumn( "movingSum",
sum(customers("amountSpent")).over(wSpec1) ).show()
output
+-----+----------+-----------+---------+
| name| date|amountSpent|movingSum|
+-----+----------+-----------+---------+
| Bob|2016-05-01| 25.0| 25.0|
| Bob|2016-05-04| 29.0| 54.0|
| Bob|2016-05-06| 27.0| 81.0|
|Alice|2016-05-01| 50.0| 50.0|
|Alice|2016-05-03| 45.0| 95.0|
|Alice|2016-05-04| 55.0| 150.0|
+-----+----------+-----------+---------+

How to convert Iterable[String, String, String] to DataFrame?

I have a dataset of (String, String, String) which is about 6GB. After parsing the dataset I did groupby using (element => element._2) and got RDD[(String, Iterable[String, String, String])]. Then foreach element in groupby I am doing toList in-order to convert it to DataFrame.
val dataFrame = groupbyElement._2.toList.toDF()
But It is taking a huge amount of time to save data as parquet file format.
Is there any efficient way I can use?
N.B. I have five node cluster. Each node has 28 GB RAM and 4 cores. I am using standalone mode and giving 16 GB RAM to each executor.
You can try using the dataframe/dataset methods instead of those for RDD. It can look something like this:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq(
("ABC", "123", "a"),
("ABC", "321", "b"),
("BCA", "123", "c")).toDF("Col1", "Col2", "Col3")
scala> df.show
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
| ABC| 123| a|
| ABC| 321| b|
| BCA| 123| c|
+----+----+----+
val df2 = df
.groupBy($"Col2")
.agg(
collect_list($"Col1") as "Col1_list"),
collect_list($"Col3") as "Col3_list"))
scala> df2.show
+----+----------+---------+
|Col2| Col1_list|Col3_list|
+----+----------+---------+
| 123|[ABC, BCA]| [a, c]|
| 321| [ABC]| [b]|
+----+----------+---------+
Additionally, instead of reading the data into a RDD you could make use of the methods to get a dataframe directly.

Elasticsearch to Spark Streaming

I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.
You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)

Complete a RDD based on RDDs depending on data

I'm using spark 2.1 on yarn cluster. I have a RDD that contains data I would like to complete based on other RDDs (which correspond to different mongo databases that I get through https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage, but I don't think that is important, just mention it in case)
My problem is that the RDD I have to use to complete data depends on data itself because data contain the database to use. Here is a simplified exemple of what I have to do :
/*
* The RDD which needs information from databases
*/
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
/*
* Artificial databases for the exemple. Actually, mongo-hadoop is used. https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
* This means that generate these RDDs COSTS so we don't want to generate all possible RDDs but only needed ones
*/
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
val generateRDDfromdbName = Map("A" -> A, "B" -> B, "C" -> C)
and the wanted output is :
Map(dbName -> A, id -> id8, other data -> some other other data, new data -> data8)
Map(dbName -> A, id -> id1, other data -> some data, new data -> data1)
Map(dbName -> C, id -> id6, other data -> some other data, new data -> data6)
Since nested RDDs are not possible, I would like to find the best way to use as possible as I can for Spark paralellism. I thought about 2 solutions.
First is creating a collection with the contents of the needed db, then convert it to RDD to benefit of RDD scalability (if the collection doesn't fit into driver memory, I could do it in several times). Finally do a join and filter the content on id.
Second is get the RDDs from all needed databases, key them by dbname and id and then do the join.
Here is the scala code :
Solution 1
// Get all needed DB
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Fill a list with key value pairs as (dbName,db content)
var dbContents = List[(String,Array[Map[String,String]])]()
dbList.foreach(dbName => dbContents = (dbName,generateRDDfromdbName(dbName).collect()) :: dbContents)
// Generate a RDD from this list to benefit to advantages of RDD
val RDDdbs = sc.parallelize(dbContents)
// Key the initial RDD by dbName and join with the contents of dbs
val joinedRDD = RDDtoDevelop.keyBy(map => map("dbName")).join(RDDdbs)
// Check for matched ids between RDD data to develop and dbContents
val result = joinedRDD.map({ case (s,(maptoDeveleop,content)) => maptoDeveleop + ("new data" -> content.find(mapContent => mapContent("id") == maptoDeveleop("id")).get("data"))})
Solution 2
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Create the list of the database RDDs keyed by (dbName, id)
var dbRDDList = List[RDD[((String,String),Map[String,String])]]()
dbList.foreach(dbName => dbRDDList = generateRDDfromdbName(dbName).keyBy(map => (dbName,map("id"))) :: dbRDDList)
// Create a RDD containing all dbRDD
val RDDdbs = sc.union(dbRDDList)
// Join the initial RDD based on the key with the dbRDDs
val joinedRDD = RDDtoDevelop.keyBy(map => (map("dbName"), map("id"))).join(RDDdbs)
// Reformate the result
val result = joinedRDD.map({ case ((dbName,id),(maptoDevelop,dbmap)) => maptoDevelop + ("new data" -> dbmap("data"))})
Both of them give the wanted output. To my mind, second one seems better since the match of the db and of the id use the paralellism of Spark, but I'm not sure of that. Could you please help me to choose the best, or even better, give me clues for a better solution than mines.
Any other comment is appreciated ( It's my first question on the site ;) ).
Thanks by advance,
Matt
I would suggest you to convert your RDDs to dataframes and then joins, distinct and other functions that you would want to apply to the data would be very easy.
Dataframes are distributed and with addition to dataframe apis, sql queries can be used. More information can be found in Spark SQL, DataFrames and Datasets Guide and Introducing DataFrames in Apache Spark for Large Scale Data Science Moreover your need of foreach and collect functions which makes your code run slow won't be needed.
Example to convert RDDtoDevelop to dataframe is as below
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
Converting the above RDD to dataFrame
val developColumns=RDDtoDevelop.take(1).flatMap(map=>map.keys)
val developDF = RDDtoDevelop.map{value=>
val list=value.values.toList
(list(0),list(1),list(2))
}.toDF(developColumns:_*)
And the dataFrame looks as below
+------+---+---------------------+
|dbName|id |other data |
+------+---+---------------------+
|A |id1|some data |
|C |id6|some other data |
|A |id8|some other other data|
+------+---+---------------------+
Coverting your A rdd to dataframe is as below
Source code for A:
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
DataFrame code for A :
val aColumns=A.take(1).flatMap(map=>map.keys)
val aDF = A.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(aColumns:_*).withColumn("name", lit("A"))
A new column name is added with database name to have the correct join at the end with developDF.
Output for DataFrame A:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id1|data1|A |
|id8|data8|A |
+---+-----+----+
You can convert B and C in similar ways.
Source for B:
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
DataFrame for B :
val bColumns=B.take(1).flatMap(map=>map.keys)
val bDF = B.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(bColumns:_*).withColumn("name", lit("B"))
Output for B :
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1bis|B |
|id5|data5 |B |
+---+--------+----+
Source for C:
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
DataFrame code for C:
val cColumns=C.take(1).flatMap(map=>map.keys)
val cDF = C.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(cColumns:_*).withColumn("name", lit("C"))
Output for C:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id2|data2|C |
|id6|data6|C |
+---+-----+----+
After the conversion, A, B and C can be merged using union
var unionDF = aDF.union(bDF).union(cDF)
Which would be
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1 |A |
|id8|data8 |A |
|id1|data1bis|B |
|id5|data5 |B |
|id2|data2 |C |
|id6|data6 |C |
+---+--------+----+
Then its just joining the developDF and unionDF after renaming of id column of unionDF for dropping it later on.
unionDF = unionDF.withColumnRenamed("id", "id1")
unionDF = developDF.join(unionDF, developDF("id") === unionDF("id1") && developDF("dbName") === unionDF("name"), "left").drop("id1", "name")
Finally we have
+------+---+---------------------+-----+
|dbName|id |other data |data |
+------+---+---------------------+-----+
|A |id1|some data |data1|
|C |id6|some other data |data6|
|A |id8|some other other data|data8|
+------+---+---------------------+-----+
You can do the needful after that.
Note : lit function would work with following import
import org.apache.spark.sql.functions._

Spark withColumn performance

I wrote some code in spark as follows:
val df = sqlContext.read.json("s3n://blah/blah.gz").repartition(200)
val newdf = df.select("KUID", "XFF", "TS","UA").groupBy("KUID", "XFF","UA").agg(max(df("TS")) as "TS" ).filter(!(df("UA")===""))
val dfUdf = udf((z: String) => {
val parser: UserAgentStringParser = UADetectorServiceFactory.getResourceModuleParser();
val readableua = parser.parse(z)
Array(readableua.getName,readableua.getOperatingSystem.getName,readableua.getDeviceCategory.getName)
})
val df1 = newdf.withColumn("useragent", dfUdf(col("UA"))) ---PROBLEM LINE 1
val df2= df1.map {
case org.apache.spark.sql.Row(col1:String,col2:String,col3:String,col4:String, col5: scala.collection.mutable.WrappedArray[String]) => (col1,col2,col3,col4, col5(0), col5(1), col5(2))
}.toDF("KUID", "XFF","UA","TS","browser", "os", "device")
val dataset =df2.dropDuplicates(Seq("KUID")).drop("UA")
val mobile = dataset.filter(dataset("device")=== "Smartphone" || dataset("device") === "Tablet" ).
mobile.write.format("com.databricks.spark.csv").save("s3n://blah/blah.csv")
Here is a sample of the input data
{"TS":"1461762084","XFF":"85.255.235.31","IP":"10.75.137.217","KUID":"JilBNVgx","UA":"Flixster/1066 CFNetwork/758.3.15 Darwin/15.4.0" }
So in the above code snippet, i am reading a gz file of 2.4GB size. The read is taking 9minutes.The i group by ID and take the max timestamp.However(at PROBLEM LINE 1) the line which adds a column(with Column) is taking 2 hours.This line takes a User Agent and tries to derive OS,Device, Broswer info. Is this the wrong way to do things here.
I am running this on 4 node AWS cluster with r3.4xlarge ( 8 cores and 122Gb memory) with the following configuration
--executor-memory 30G --num-executors 9 --executor-cores 5
The problem here is that gzip is not splittable, and cannot be read in parallel. What happens in the background is that a single process will download the file from the bucket and then it will repartition it to distribute the data across the cluster. Please re-encode the input data to a splittable format. If the input file does not change a lot, you could for example consider bzip2 (because encoding is quite expensive and might take some time).
Update: Picking up answer from Roberto and sticking it here for the benefit of all
You are creating a new parser for every row within the UDF : val parser: UserAgentStringParser = UADetectorServiceFactory.getResourceModuleParser(); . It's probably expensive to construct it, you should construct one outside the UDF and use it as a closure

Resources