I'm using spark 2.1 on yarn cluster. I have a RDD that contains data I would like to complete based on other RDDs (which correspond to different mongo databases that I get through https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage, but I don't think that is important, just mention it in case)
My problem is that the RDD I have to use to complete data depends on data itself because data contain the database to use. Here is a simplified exemple of what I have to do :
/*
* The RDD which needs information from databases
*/
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
/*
* Artificial databases for the exemple. Actually, mongo-hadoop is used. https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
* This means that generate these RDDs COSTS so we don't want to generate all possible RDDs but only needed ones
*/
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
val generateRDDfromdbName = Map("A" -> A, "B" -> B, "C" -> C)
and the wanted output is :
Map(dbName -> A, id -> id8, other data -> some other other data, new data -> data8)
Map(dbName -> A, id -> id1, other data -> some data, new data -> data1)
Map(dbName -> C, id -> id6, other data -> some other data, new data -> data6)
Since nested RDDs are not possible, I would like to find the best way to use as possible as I can for Spark paralellism. I thought about 2 solutions.
First is creating a collection with the contents of the needed db, then convert it to RDD to benefit of RDD scalability (if the collection doesn't fit into driver memory, I could do it in several times). Finally do a join and filter the content on id.
Second is get the RDDs from all needed databases, key them by dbname and id and then do the join.
Here is the scala code :
Solution 1
// Get all needed DB
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Fill a list with key value pairs as (dbName,db content)
var dbContents = List[(String,Array[Map[String,String]])]()
dbList.foreach(dbName => dbContents = (dbName,generateRDDfromdbName(dbName).collect()) :: dbContents)
// Generate a RDD from this list to benefit to advantages of RDD
val RDDdbs = sc.parallelize(dbContents)
// Key the initial RDD by dbName and join with the contents of dbs
val joinedRDD = RDDtoDevelop.keyBy(map => map("dbName")).join(RDDdbs)
// Check for matched ids between RDD data to develop and dbContents
val result = joinedRDD.map({ case (s,(maptoDeveleop,content)) => maptoDeveleop + ("new data" -> content.find(mapContent => mapContent("id") == maptoDeveleop("id")).get("data"))})
Solution 2
val dbList = RDDtoDevelop.map(map => map("dbName")).distinct().collect()
// Create the list of the database RDDs keyed by (dbName, id)
var dbRDDList = List[RDD[((String,String),Map[String,String])]]()
dbList.foreach(dbName => dbRDDList = generateRDDfromdbName(dbName).keyBy(map => (dbName,map("id"))) :: dbRDDList)
// Create a RDD containing all dbRDD
val RDDdbs = sc.union(dbRDDList)
// Join the initial RDD based on the key with the dbRDDs
val joinedRDD = RDDtoDevelop.keyBy(map => (map("dbName"), map("id"))).join(RDDdbs)
// Reformate the result
val result = joinedRDD.map({ case ((dbName,id),(maptoDevelop,dbmap)) => maptoDevelop + ("new data" -> dbmap("data"))})
Both of them give the wanted output. To my mind, second one seems better since the match of the db and of the id use the paralellism of Spark, but I'm not sure of that. Could you please help me to choose the best, or even better, give me clues for a better solution than mines.
Any other comment is appreciated ( It's my first question on the site ;) ).
Thanks by advance,
Matt
I would suggest you to convert your RDDs to dataframes and then joins, distinct and other functions that you would want to apply to the data would be very easy.
Dataframes are distributed and with addition to dataframe apis, sql queries can be used. More information can be found in Spark SQL, DataFrames and Datasets Guide and Introducing DataFrames in Apache Spark for Large Scale Data Science Moreover your need of foreach and collect functions which makes your code run slow won't be needed.
Example to convert RDDtoDevelop to dataframe is as below
val RDDtoDevelop = sc.parallelize(Array(
Map("dbName" -> "A", "id" -> "id1", "other data" -> "some data"),
Map("dbName" -> "C", "id" -> "id6", "other data" -> "some other data"),
Map("dbName" -> "A", "id" -> "id8", "other data" -> "some other other data")))
.cache()
Converting the above RDD to dataFrame
val developColumns=RDDtoDevelop.take(1).flatMap(map=>map.keys)
val developDF = RDDtoDevelop.map{value=>
val list=value.values.toList
(list(0),list(1),list(2))
}.toDF(developColumns:_*)
And the dataFrame looks as below
+------+---+---------------------+
|dbName|id |other data |
+------+---+---------------------+
|A |id1|some data |
|C |id6|some other data |
|A |id8|some other other data|
+------+---+---------------------+
Coverting your A rdd to dataframe is as below
Source code for A:
val A = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1"),
Map("id" -> "id8", "data" -> "data8")
))
DataFrame code for A :
val aColumns=A.take(1).flatMap(map=>map.keys)
val aDF = A.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(aColumns:_*).withColumn("name", lit("A"))
A new column name is added with database name to have the correct join at the end with developDF.
Output for DataFrame A:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id1|data1|A |
|id8|data8|A |
+---+-----+----+
You can convert B and C in similar ways.
Source for B:
val B = sc.parallelize(Array(
Map("id" -> "id1", "data" -> "data1bis"),
Map("id" -> "id5", "data" -> "data5")
))
DataFrame for B :
val bColumns=B.take(1).flatMap(map=>map.keys)
val bDF = B.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(bColumns:_*).withColumn("name", lit("B"))
Output for B :
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1bis|B |
|id5|data5 |B |
+---+--------+----+
Source for C:
val C = sc.parallelize(Array(
Map("id" -> "id2", "data" -> "data2"),
Map("id" -> "id6", "data" -> "data6")
))
DataFrame code for C:
val cColumns=C.take(1).flatMap(map=>map.keys)
val cDF = C.map{value =>
val list=value.values.toList
(list(0),list(1))
}.toDF(cColumns:_*).withColumn("name", lit("C"))
Output for C:
+---+-----+----+
|id |data |name|
+---+-----+----+
|id2|data2|C |
|id6|data6|C |
+---+-----+----+
After the conversion, A, B and C can be merged using union
var unionDF = aDF.union(bDF).union(cDF)
Which would be
+---+--------+----+
|id |data |name|
+---+--------+----+
|id1|data1 |A |
|id8|data8 |A |
|id1|data1bis|B |
|id5|data5 |B |
|id2|data2 |C |
|id6|data6 |C |
+---+--------+----+
Then its just joining the developDF and unionDF after renaming of id column of unionDF for dropping it later on.
unionDF = unionDF.withColumnRenamed("id", "id1")
unionDF = developDF.join(unionDF, developDF("id") === unionDF("id1") && developDF("dbName") === unionDF("name"), "left").drop("id1", "name")
Finally we have
+------+---+---------------------+-----+
|dbName|id |other data |data |
+------+---+---------------------+-----+
|A |id1|some data |data1|
|C |id6|some other data |data6|
|A |id8|some other other data|data8|
+------+---+---------------------+-----+
You can do the needful after that.
Note : lit function would work with following import
import org.apache.spark.sql.functions._
Related
I need to process the data stored in s3 and store it in snowflake.
During several tests, I discovered a performance issue.
common config
Data A (stored in s3): 20GB
Data B (stored in s3 and snowflake): 8.5KB
Operation: left outer join
Using EMR(spark) r5.4xlarge(5)
when i read Data A and Data B(snowflake), it elapsed more than 1 hour, 12 mins
val Adf= spark.read.parquet("s3://path")
var sfOptions = Map.apply(
"sfURL" -> "XXXXX.us-east-1.snowflakecomputing.com",
"sfUser" -> XXXXX",
"sfPassword" -> "XXXX",
"sfDatabase" -> "XXX",
"sfSchema" -> "XXXX",
"sfWarehouse" -> "XXXX"
)
val Bdf: DataFrame = spark.sqlContext.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable","XXXX")
.load()
val resultDF = Adf.join(Bdf, Seq("CNTY"), "leftouter")
resultDF.write
.fomat(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable","t_result_from spark")
.opton("parallelism","8")
.mode(SaveMode.Overwrite)
.save()
but when i read Data A and Data B(s3), it elapsed just 10 mins.
val Adf= spark.read.parquet("s3://path")
val Bdf spark.read.option("header","true").csv("s3://path")
val resultDF = Adf.join(Bdf, Seq("CNTY"), "leftouter")
resultDF.write
.fomat(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable","t_result_from spark_2")
.opton("parallelism","8")
.mode(SaveMode.Overwrite)
.save()
why is there a different performance between reading on S3 and reading on snowflake?
I have a dataset of (String, String, String) which is about 6GB. After parsing the dataset I did groupby using (element => element._2) and got RDD[(String, Iterable[String, String, String])]. Then foreach element in groupby I am doing toList in-order to convert it to DataFrame.
val dataFrame = groupbyElement._2.toList.toDF()
But It is taking a huge amount of time to save data as parquet file format.
Is there any efficient way I can use?
N.B. I have five node cluster. Each node has 28 GB RAM and 4 cores. I am using standalone mode and giving 16 GB RAM to each executor.
You can try using the dataframe/dataset methods instead of those for RDD. It can look something like this:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq(
("ABC", "123", "a"),
("ABC", "321", "b"),
("BCA", "123", "c")).toDF("Col1", "Col2", "Col3")
scala> df.show
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
| ABC| 123| a|
| ABC| 321| b|
| BCA| 123| c|
+----+----+----+
val df2 = df
.groupBy($"Col2")
.agg(
collect_list($"Col1") as "Col1_list"),
collect_list($"Col3") as "Col3_list"))
scala> df2.show
+----+----------+---------+
|Col2| Col1_list|Col3_list|
+----+----------+---------+
| 123|[ABC, BCA]| [a, c]|
| 321| [ABC]| [b]|
+----+----------+---------+
Additionally, instead of reading the data into a RDD you could make use of the methods to get a dataframe directly.
I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.
You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)
This question already has answers here:
How to remove parentheses around records when saveAsTextFile on RDD[(String, Int)]?
(6 answers)
Closed 6 years ago.
How do I remove the parenthesis "(" and ")" from the output by the below spark job?
When I try to read the spark output using PigScript it creates a problem.
My code:
scala> val words = Array("HI","HOW","ARE")
words: Array[String] = Array(HI, HOW, ARE)
scala> val wordsRDD = sc.parallelize(words)
wordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val keyvalueRDD = wordsRDD.map(elem => (elem,1))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:25
scala> val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:27
scala> wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")
Output as per above code :
hadoop dfs -cat /user/cloudera/outputfiles/part*
(HOW,1)
(ARE,1)
(HI,1)
But I want the output of spark to be stored as below as without parenthesis
HOW,1
ARE,1
HI,1
Now I want to read the above output using a PigScript.
LOAD statement in Pigscript treats "(HOW" as first atom and "1)" as second atom
Is there anyway we can get rid off parenthesis in spark code itself as I don't want to apply the fix for this in pigscript..
Pig script :
records = LOAD '/user/cloudera/outputfiles' USING PigStorage(',') AS (word:chararray);
dump records;
Pig output :
((HOW)
((ARE)
((HI)
Use map transformation before you save the records to outputfiles directory, e.g.
wordcountRDD.map { case (k, v) => s"$k, $v" }.saveAsTextFile("/user/cloudera/outputfiles")
See Spark's documentation about map.
I strongly recommend using Datasets instead.
scala> words.toSeq.toDS.groupBy("value").count().show
+-----+-----+
|value|count|
+-----+-----+
| HOW| 1|
| ARE| 1|
| HI| 1|
+-----+-----+
scala> words.toSeq.toDS.groupBy("value").count.write.csv("outputfiles")
$ cat outputfiles/part-00199-aa752576-2f65-481b-b4dd-813262abb6c2-c000.csv
HI,1
See Spark SQL, DataFrames and Datasets Guide.
This format is a format of Tuple. You can manually define your format:
val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y)
// here we set custom format
.map(x => x._1 + "," + x._2)
wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")
I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})