How to convert DStream[(Array[String], Long)] to dataframe in Spark Streaming - spark-streaming

i am trying to convert my dstream to Dataframe.here is the code that am using to convert my dstream to Dataframe
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "ffff.dl.uk.fff.com:8002",
"security.protocol" -> "SASL_PLAINTEXT",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("mytopic")
val from_kafkastream = KafkaUtils.createDirectStream[String,
String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val strmk = from_kafkastream.map(record =>
(record.value,record.timestamp))
val splitup2 = strmk.map{ case (line1, line2) =>
(line1.split(","),line2)}
case class Record(name: String, trQ: String, traW: String,traNS:
String, traned: String, tranS: String,transwer: String, trABN:
String,kafkatime: Long)
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
splitup2.foreachRDD((rdd) => {
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
spark.sparkContext.setLogLevel("ERROR")
import sqlContext.implicits._
val requestsDataFrame = rdd.map(w => Record(w(0).toString,
w(1).toString, w(2).toString,w(3).toString, w(4).toString,
w(5).toString,w(6).toString, w(7).toString,w(8).toString)).toDF()
// am getting issue here
requestsDataFrame.show()
})
ssc.start()
I am getting error saying following
can someone help how to convert my dstreams to DF as i am new spark world

Maybe the mistake is when build the Record object because , you don't pass the kafkatime , only string values, and also is tuple you can't access to the atribute array of this form.
You can try this :
import session.sqlContext.implicits._
val requestsDataFrame = rdd.map(w => Record(
w._1(0).toString,
w._1(1).toString, w._1(2).toString, w._1(3).toString, w._1.toString,
w._1(5).toString, w._1(6).toString, w._1(7).toString, w._2))
requestsDataFrame.toDF()

Related

Exceptions not captured in opencsv parsing

I'm trying to parse a csv file and map it onto a data class. I've setup some validations for the columns and I'm testing it by sending incorrect values for those columns. opencsv throws a generic exception
Basic instantiation of the given bean type (and subordinate beans created through recursion, if applicable) was determined to be impossible.
Code details
Data class:
data class UserInfo(
#CsvBindByName(column = "Id", required = true) val id: Long,
#CsvBindByName(column = "FirstName", required = true) val firstName: String,
#CsvBindByName(column = "LastName", required = true) val lastName: String,
#CsvBindByName(column = "Email", required = true) val email: String,
#CsvBindByName(column = "PhoneNumber", required = true) val phoneNumber: String,
#PreAssignmentValidator(
validator = MustMatchRegexExpression::class, paramString = "^[0-9]{10}$")
#CsvBindByName(column = "Age", required = true)
val age: Int
)
csv parsing logic
fun uploadCsvFile(file: MultipartFile): List<UserInfo> {
throwIfFileEmpty(file)
var fileReader: BufferedReader? = null
try {
fileReader = BufferedReader(InputStreamReader(file.inputStream))
val csvToBean = createCSVToBean(fileReader)
val mappingStrategy: HeaderColumnNameMappingStrategy<Any> =
HeaderColumnNameMappingStrategy<Any>()
mappingStrategy.type = UserInfo::class.java
val userInfos = csvToBean.parse()
userInfos.stream().forEach { user -> println("Parsed data:$user") }
csvToBean.capturedExceptions.stream().forEach { ex -> println(ex.message) }
return userInfos
} catch (ex: Exception) {
throw CsvImportException("Error during csv import")
} finally {
closeFileReader(fileReader)
}
}
private fun createCSVToBean(fileReader: BufferedReader?): CsvToBean<UserInfo> =
CsvToBeanBuilder<UserInfo>(fileReader)
.withType(UserInfo::class.java)
.withThrowExceptions(false)
.withIgnoreLeadingWhiteSpace(true)
.build()
I'm looking for the proper error message for the validation / missing field so that I can communicate it to the error response.

Can #SqlResultSetMapping be used to map a complex Dto object

I currently have a named native query set up in CrudRepository where I'm joinnig few tables and I need to map that query result into a Dto.
select
event_id, replaced_by_match_id, scheduled, start_time_tbd, status, away_team_competitor_id, home_team_competitor_id, round_round_id, season_season_id, tournament_tournament_id, venue_venue_id,
competitorHome.competitor_id as home_competitor_competitor_id, competitorHome.abbreviation as home_competitor_competitor_abbreviation, competitorHome.country_code as home_competitor_ccountry_code, competitorHome.ioc_code as home_competitor_ioc_code, competitorHome.rotation_number as home_competitor_rotation_number, competitorHome.virtual as home_competitor_virtual,
competitorAway.competitor_id as away_competitor_competitor_id, competitorAway.abbreviation as away_competitor_competitor_abbreviation, competitorAway.country_code as away_competitor_ccountry_code, competitorAway.ioc_code as away_competitor_ioc_code, competitorAway.rotation_number as away_competitor_rotation_number, competitorAway.virtual as away_competitor_virtual,
homeTeamTranslation.competitor_competitor_id as home_team_translation_competitor_competitor_id, homeTeamTranslation.language_language_id as home_team_translation_language_language_id, homeTeamTranslation.competitor_name as home_team_translation_competitor_name, homeTeamTranslation.competitor_country as home_team_competitor_country,
awayTeamTranslation.competitor_competitor_id as away_team_translation_competitor_competitor_id, awayTeamTranslation.language_language_id as away_team_translation_language_language_id, awayTeamTranslation.competitor_name as away_team_translation_competitor_name, awayTeamTranslation.competitor_country as away_team_competitor_country
from "event" as e
left join competitor as competitorAway on competitorAway.competitor_id = e.away_team_competitor_id
left join competitor as competitorHome on competitorHome.competitor_id = e.home_team_competitor_id
left join competitor_translation as homeTeamTranslation on competitorHome.competitor_id = homeTeamTranslation.competitor_competitor_id
left join competitor_translation as awayTeamTranslation on competitorAway.competitor_id = awayTeamTranslation.competitor_competitor_id
where awayTeamTranslation.language_language_id = 'en' and homeTeamTranslation.language_language_id = 'en'
I'm trying to use #SqlResultSetMapping annotation to map result into Dto classes but unsuccessfully.
I've set up mapping this way
#SqlResultSetMapping(
name = "mapLocalizedEvent",
classes = [ConstructorResult(
targetClass = TranslatedLocalEvent::class,
columns = arrayOf(
ColumnResult(name = "event_id"),
ColumnResult(name = "scheduled"),
ColumnResult(name = "start_time_tbd"),
ColumnResult(name = "status"),
ColumnResult(name = "replaced_by_match_id")
)
)]
)
and it is working fine where all of the ColumnResult used are simple types String or Boolean. It maps to object TranslatedLocalEvent looking like this
class TranslatedLocalEvent(
val eventId: String? = null,
val scheduled: String? = null,
val startTimeTbd: Boolean? = null,
val status: String? = null,
val replacedByMatchId: String? = null
)
Is there a way I can use this approach to map a complex object? TranslatedLocalEvent object needs to contain TranslatedLocalCompetitor object built from parts of columns query returnes
class TranslatedLocalEvent(
val eventId: String? = null,
val scheduled: String? = null,
val startTimeTbd: Boolean? = null,
val status: String? = null,
val replacedByMatchId: String? = null,
val homeTeam: TranslatedLocalCompetitor? = null
)
public class TranslatedLocalCompetitor(
val competitorId: String? = null
val competitorName: String? = null
val competitorCountry: String? = null
)
The easiest way i see is in your TranslatedLocalEvent constructor accept all columns and in the contstructor create and assign the TranslatedLocalCompetitor object.

How to add timestamp from kafka to spark streaming during converting to DF

I am doing spark streaming from kafka. I want to convert my rdd from kafka to dataframe.
i am using following approach.
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(4))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "dofff2.dl.uk.feefr.com:8002",
"security.protocol" -> "SASL_PLAINTEXT",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("csv")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val strmk = stream.map(record => (record.value))
val rdd1 = strmk.map(line => line.split(',')).map(s => (s(0).toString, s(1).toString,s(2).toString,s(3).toString,s(4).toString, s(5).toString,s(6).toString,s(7).toString))
rdd1.foreachRDD((rdd, time) => {
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val requestsDataFrame = rdd.map(w => Record(w._1, w._2, w._3,w._4, w._5, w._6,w._7, w._8)).toDF()
requestsDataFrame.createOrReplaceTempView("requests")
val word_df =sqlContext.sql("select * from requests ")
println(s"========= $time =========")
word_df.show()
})
But in the dataframe i want to include timestamp from kafka also. can someone help how to do it ?
Kafka records have various attributes.
See https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
Note that there is a Streaming and Batch approach to Kafka.
An example:
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
sparkSession.sparkContext.setLogLevel("ERROR")
val socketStreamDs = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "AAA")
.option("startingOffsets", "earliest")
.load()
//.as[String]
//
//.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(timestamp AS STRING)")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.writeStream
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Append())
.start().awaitTermination()
My sample output is as follows:
-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+-----------------------+
|key |value|timestamp |
+----+-----+-----------------------+
|null|RRR |2019-02-07 04:37:34.983|
|null|HHH |2019-02-07 04:37:36.802|
|null|JJJ |2019-02-07 04:37:39.1 |
+----+-----+-----------------------+
For non-structured Streaming though,
You just need to expand your statement above:
stream.map { record => (record.timestamp(), record.key(), record.value()) }

how to spark streaming use connection pool for impala(JDBC to kudu)

i use impala(JDBC) twice to get kafka offset and save data in foreachRDD.
but impala and kudu always shutdown. now i want to set connect pool, but little for scala.
it's my pseudo-code:
#node-1
val newOffsets = getNewOffset() // JDBC read kafka offset in kudu
val messages = KafkaUtils.createDirectStream(*,newOffsets,)
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
#node-2
Class.forName(jdbcDriver)
val con = DriverManager.getConnection("impala url")
val stmt = con.createStatement()
stmt.executeUpdate(sql)
#node-3
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach { r => {
val rt_upsert = s"UPSert into ${execTable} values('${r.topic}',${r.partition},${r.untilOffset})"
stmt.executeUpdate(rt_upsert)
stmt.close()
conn.close()
}}
}
how to code by c3p0 or other ? I'll appreciate your help.
Below is the code for reading data from kafka and inserting the data to kudu.
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kudu.client.KuduClient
import org.apache.kudu.client.KuduSession
import org.apache.kudu.client.KuduTable
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.Milliseconds,
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import scala.collection.immutable.List
import scala.collection.mutable
import scala.util.control.NonFatal
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types._
import org.apache.kudu.Schema
import org.apache.kudu.Type._
import org.apache.kudu.spark.kudu.KuduContext
import scala.collection.mutable.ArrayBuffer
object KafkaKuduConnect extends Serializable {
def main(args: Array[String]): Unit = {
try {
val TopicName="TestTopic"
val kafkaConsumerProps = Map[String, String]("bootstrap.servers" -> "localhost:9092")
val KuduMaster=""
val KuduTable=""
val sparkConf = new SparkConf().setAppName("KafkaKuduConnect")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val ssc = new StreamingContext(sc, Milliseconds(1000))
val kuduContext = new KuduContext(KuduMaster, sc)
val kuduclient: KuduClient = new KuduClient.KuduClientBuilder(KuduMaster).build()
//Opening table
val kudutable: KuduTable = kuduclient.openTable(KuduTable)
// getting table schema
val tableschema: Schema = kudutable.getSchema
// creating the schema for the data frame using the table schema
val FinalTableSchema =generateStructure(tableschema)
//To create the schema for creating the data frame from the rdd
def generateStructure(tableSchema:Schema):StructType=
{
var structFieldList:List[StructField]=List()
for(index <-0 until tableSchema.getColumnCount)
{
val col=tableSchema.getColumnByIndex(index)
val coltype=col.getType.toString
println(coltype)
col.getType match {
case INT32 =>
structFieldList=structFieldList:+StructField(col.getName,IntegerType)
case STRING =>
structFieldList=structFieldList:+StructField(col.getName,StringType)
case _ =>
println("No Class Type Found")
}
}
return StructType(structFieldList)
}
// To create the Row object with values type casted according to the schema
def getRow(schema:StructType,Data:List[String]):Row={
var RowData=ArrayBuffer[Any]()
schema.zipWithIndex.foreach(
each=>{
var Index=each._2
each._1.dataType match {
case IntegerType=>
if(Data(Index)=="" | Data(Index)==null)
RowData+=Data(Index).toInt
case StringType=>
RowData+=Data(Index)
case _=>
RowData+=Data(Index)
}
}
)
return Row.fromSeq(RowData.toList)
}
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerProps, Set(TopicName))
messages.foreachRDD(
//we are looping through eachrdd
eachrdd=>{
//we are creating the Rdd[Row] to create dataframe with our schema
val StructuredRdd= eachrdd.map(eachmessage=>{
val record=eachmessage._2
getRow(FinalTableSchema, record.split(",").toList)
})
//DataFrame with required structure according to the table.
val DF = sqlContext.createDataFrame(StructuredRdd, FinalTableSchema)
kuduContext.upsertRows(DF,KuduTable)
}
)
}
catch {
case NonFatal(e) => {
print("Error in main : " + e)
}
}
}
}

apache Spark with hive

How can i read/write data from/to hive?
Is it necessary to compile spark with hive profile to interact with hive?
which maven dependencies are required to interact with hive?
i could not find a well documentation to follow step by step to get working with hive.
Currently Here is my code
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("test"))
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sqlCon = new SQLContext(sc)
val schemaString = "Date:string,Open:double,High:double,Low:double,Close:double,Volume:double,Adj_Close:double"
val schema =
StructType(
schemaString.split(",").map(fieldName => StructField(fieldName.split(":")(0),
getFieldTypeInSchema(fieldName.split(":")(1)), true)))
val rdd = sc.textFile("hdfs://45.55.159.119:9000/yahoo_stocks.csv")
//val rdd = sc.parallelize(arr)
val rowRDDx = noHeader.map(p => {
var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
var index = 0
val regex = rowSplittingRegexBuilder(Seq(","))
var tokens = p.split(regex)
tokens.foreach(value => {
var valType = schema.fields(index).dataType
var returnVal: Any = null
valType match {
case IntegerType => returnVal = value.toString.toInt
case DoubleType => returnVal = value.toString.toDouble
case LongType => returnVal = value.toString.toLong
case FloatType => returnVal = value.toString.toFloat
case ByteType => returnVal = value.toString.toByte
case StringType => returnVal = value.toString
case TimestampType => returnVal = value.toString
}
list = list :+ returnVal
index += 1
})
Row.fromSeq(list)
})
val df = sqlCon.applySchema(rowRDDx, schema)
HiveContext.sql("create table yahoo_orc_table (date STRING, open_price FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, volume INT, adj_price FLOAT) stored as orc")
df.saveAsTable("hive", "org.apache.spark.sql.hive.orc", SaveMode.Append)
I am getting following exception
15/10/12 14:57:36 INFO storage.BlockManagerMaster: Registered BlockManager
15/10/12 14:57:38 INFO scheduler.EventLoggingListener: Logging events to hdfs://host:9000/spark/logs/local-1444676256555
Exception in thread "main" java.lang.VerifyError: Bad return type
Exception Details:
Location:
org/apache/spark/sql/catalyst/expressions/Pmod.inputType()Lorg/apache/spark/sql/types/AbstractDataType; #3: areturn
Reason:
Type 'org/apache/spark/sql/types/NumericType$' (current frame, stack[0]) is not assignable to 'org/apache/spark/sql/types/AbstractDataType' (from method signature)
Current Frame:
bci: #3
flags: { }
locals: { 'org/apache/spark/sql/catalyst/expressions/Pmod' }
stack: { 'org/apache/spark/sql/types/NumericType$' }
Bytecode:
0000000: b200 63b0
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
at java.lang.Class.getConstructor0(Class.java:2895)
at java.lang.Class.getDeclaredConstructor(Class.java:2066)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$4.apply(FunctionRegistry.scala:267)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$4.apply(FunctionRegistry.scala:267)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.expression(FunctionRegistry.scala:267)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.<init>(FunctionRegistry.scala:148)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.<clinit>(FunctionRegistry.scala)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:414)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:413)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:39)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:203)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:72)
thanks
As dr_stein mentioned, this error is usually due to incompatible compile time and runtime JDK versions, such as running JDK 1.6 with 1.7 compiled jars.
I would also check if your hive libraries reflect the correct versions and if your hive server is also running on the same JDK as you are.
you can also try running with the -noverify option, which will disable verification.

Resources