How to add timestamp from kafka to spark streaming during converting to DF - spark-streaming

I am doing spark streaming from kafka. I want to convert my rdd from kafka to dataframe.
i am using following approach.
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(4))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "dofff2.dl.uk.feefr.com:8002",
"security.protocol" -> "SASL_PLAINTEXT",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("csv")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val strmk = stream.map(record => (record.value))
val rdd1 = strmk.map(line => line.split(',')).map(s => (s(0).toString, s(1).toString,s(2).toString,s(3).toString,s(4).toString, s(5).toString,s(6).toString,s(7).toString))
rdd1.foreachRDD((rdd, time) => {
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val requestsDataFrame = rdd.map(w => Record(w._1, w._2, w._3,w._4, w._5, w._6,w._7, w._8)).toDF()
requestsDataFrame.createOrReplaceTempView("requests")
val word_df =sqlContext.sql("select * from requests ")
println(s"========= $time =========")
word_df.show()
})
But in the dataframe i want to include timestamp from kafka also. can someone help how to do it ?

Kafka records have various attributes.
See https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
Note that there is a Streaming and Batch approach to Kafka.
An example:
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
sparkSession.sparkContext.setLogLevel("ERROR")
val socketStreamDs = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "AAA")
.option("startingOffsets", "earliest")
.load()
//.as[String]
//
//.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(timestamp AS STRING)")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")
.writeStream
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Append())
.start().awaitTermination()
My sample output is as follows:
-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+-----------------------+
|key |value|timestamp |
+----+-----+-----------------------+
|null|RRR |2019-02-07 04:37:34.983|
|null|HHH |2019-02-07 04:37:36.802|
|null|JJJ |2019-02-07 04:37:39.1 |
+----+-----+-----------------------+
For non-structured Streaming though,
You just need to expand your statement above:
stream.map { record => (record.timestamp(), record.key(), record.value()) }

Related

What is the best Scala library for writing/reading PNG image files?

What is the best Scala library for writing/reading PNG image files? I'm looking for the Scala-equivalent of libpng (c++) and pypng (Python)
As java libraries can be used in scala; So you can use java library [javax imageIO] or [java awt] or both. You can find plenty of their examples on internet. Here I'm sharing one.
Below is sample code for the same.
import org.opencv.core.Mat
import javax.imageio.ImageIO
import java.io.ByteArrayInputStream
import java.awt.image.DataBufferByte
import org.opencv.core.CvType
import org.opencv.core.Core
import org.opencv.features2d.FeatureDetector
import org.opencv.core.MatOfKeyPoint
import org.opencv.features2d.Features2d
import org.opencv.highgui.Highgui
import java.awt.image.BufferedImage
import java.io.File
object OpenCVOps {
System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
def imageToMat(byteArray: Array[Byte]): MatSer ={
val bufferedImage = ImageIO.read(new ByteArrayInputStream(byteArray))
val mat = new Mat(bufferedImage.getHeight(), bufferedImage.getWidth(), CvType.CV_8UC3);
val data = bufferedImage.getRaster().getDataBuffer.asInstanceOf[DataBufferByte].getData();
mat.put(0, 0, data);
new MatSer(mat);
}
def matToImage(mat: Mat,file: String): Boolean ={
val imageType = BufferedImage.TYPE_3BYTE_BGR;
val image = new BufferedImage(mat.cols(),mat.rows(), imageType);
val x = image.getRaster.getDataBuffer.asInstanceOf[DataBufferByte].getData
mat.get(0,0,x)
val fileName = new File("file")
ImageIO.write(image, "jpg", fileName)
}
def detectFeatures(mat: Mat)
{
val featureDetector = FeatureDetector.create(FeatureDetector.SIFT)
val matKeyPoint = new MatOfKeyPoint()
featureDetector.detect(mat,matKeyPoint)
println(mat.get(0, 0))
println(matKeyPoint.toList())
//writeToImage(mat,matKeyPoint)
}
def writeToImage(mat: Mat, matKeyPoint: MatOfKeyPoint){
val outImage = new Mat()
Features2d.drawKeypoints(mat, matKeyPoint, outImage)
Highgui.imwrite("myfile.jpg",outImage)
}
}
Here is one more example ..
import javax.imageio.ImageIO
trait Settings {
this: com.jme3.app.SimpleApplication =>
protected val appSettings = new com.jme3.system.AppSettings(true)
Seq(
("Width", 1920)
, ("Height", 1080)
, ("BitsPerPixel", 32)
, ("Frequency", 60)
, ("DepthBits", 24)
, ("StencilBits", 0)
, ("Samples", 2)
, ("Fullscreen", false)
, ("Title", "LW3D")
, ("Renderer", com.jme3.system.AppSettings.LWJGL_OPENGL2)
, ("AudioRenderer", com.jme3.system.AppSettings.LWJGL_OPENAL)
, ("DisableJoysticks", true)
, ("UseInput", true)
, ("VSync", false)
, ("FrameRate", 0)
, ("SettingsDialogImage", "")
, ("MinHeight", 1920)
, ("MinWidth", 1080)
) foreach { case (k, v: Object) => appSettings.put(k, v) }
appSettings.setIcons(
List(
ImageIO.read(getClass.getClassLoader.getResource("Yx/logo/Y256.png")),
ImageIO.read(getClass.getClassLoader.getResource("Yx/logo/Y128.png")),
ImageIO.read(getClass.getClassLoader.getResource("Yx/logo/Y64.png")),
ImageIO.read(getClass.getClassLoader.getResource("Yx/logo/Y32.png")),
ImageIO.read(getClass.getClassLoader.getResource("Yx/logo/Y16.png"))
).toArray
)
setSettings(appSettings)
setShowSettings(false)
setDisplayFps(true)
setDisplayStatView(false)
setPauseOnLostFocus(false)
}

How to convert DStream[(Array[String], Long)] to dataframe in Spark Streaming

i am trying to convert my dstream to Dataframe.here is the code that am using to convert my dstream to Dataframe
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "ffff.dl.uk.fff.com:8002",
"security.protocol" -> "SASL_PLAINTEXT",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("mytopic")
val from_kafkastream = KafkaUtils.createDirectStream[String,
String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val strmk = from_kafkastream.map(record =>
(record.value,record.timestamp))
val splitup2 = strmk.map{ case (line1, line2) =>
(line1.split(","),line2)}
case class Record(name: String, trQ: String, traW: String,traNS:
String, traned: String, tranS: String,transwer: String, trABN:
String,kafkatime: Long)
object SQLContextSingleton {
#transient private var instance: SQLContext = _
def getInstance(sparkContext: SparkContext): SQLContext = {
if (instance == null) {
instance = new SQLContext(sparkContext)
}
instance
}
}
splitup2.foreachRDD((rdd) => {
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
spark.sparkContext.setLogLevel("ERROR")
import sqlContext.implicits._
val requestsDataFrame = rdd.map(w => Record(w(0).toString,
w(1).toString, w(2).toString,w(3).toString, w(4).toString,
w(5).toString,w(6).toString, w(7).toString,w(8).toString)).toDF()
// am getting issue here
requestsDataFrame.show()
})
ssc.start()
I am getting error saying following
can someone help how to convert my dstreams to DF as i am new spark world
Maybe the mistake is when build the Record object because , you don't pass the kafkatime , only string values, and also is tuple you can't access to the atribute array of this form.
You can try this :
import session.sqlContext.implicits._
val requestsDataFrame = rdd.map(w => Record(
w._1(0).toString,
w._1(1).toString, w._1(2).toString, w._1(3).toString, w._1.toString,
w._1(5).toString, w._1(6).toString, w._1(7).toString, w._2))
requestsDataFrame.toDF()

How to convert parquet schema to avro in Java/Scala

Let say I have parquet file on the file system. How can I get parquet schema and convert it to Avro Schema?
Use hadoop ParquetFileReader to get Parquet schema and pass it to AvroSchemaConverter to convert it to Avro schema.
Scala code example:
import org.apache.avro.Schema
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.AvroSchemaConverter
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.parquet.hadoop.util.HadoopInputFile
object ParquetToAvroSchemaConverter {
def main(args: Array[String]): Unit = {
val path = new Path("###PATH_TO_PARQUET_FILE###")
val avroSchema = convert(path)
}
def convert(parquetPath: Path): Schema = {
val cfg = new Configuration
// Create parquet reader
val rdr = ParquetFileReader.open(HadoopInputFile.fromPath(parquetPath, cfg))
try {
// Get parquet schema
val schema = rdr.getFooter.getFileMetaData.getSchema
println("Parquet schema: ")
println("#############################################################")
print(schema.toString)
println("#############################################################")
println
// Convert to Avro
val avroSchema = new AvroSchemaConverter(cfg).convert(schema)
println("Avro schema: ")
println("#############################################################")
println(avroSchema.toString(true))
println("#############################################################")
avroSchema
}
finally {
rdr.close()
}
}
}
You have to have next dependencies in your SBT project:
libraryDependencies ++= Seq(
"org.apache.parquet" % "parquet-avro" % "1.10.0",
"org.apache.parquet" % "parquet-hadoop" % "1.10.0",
"org.apache.hadoop" % "hadoop-client" % "2.7.3",
)

how to spark streaming use connection pool for impala(JDBC to kudu)

i use impala(JDBC) twice to get kafka offset and save data in foreachRDD.
but impala and kudu always shutdown. now i want to set connect pool, but little for scala.
it's my pseudo-code:
#node-1
val newOffsets = getNewOffset() // JDBC read kafka offset in kudu
val messages = KafkaUtils.createDirectStream(*,newOffsets,)
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
#node-2
Class.forName(jdbcDriver)
val con = DriverManager.getConnection("impala url")
val stmt = con.createStatement()
stmt.executeUpdate(sql)
#node-3
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach { r => {
val rt_upsert = s"UPSert into ${execTable} values('${r.topic}',${r.partition},${r.untilOffset})"
stmt.executeUpdate(rt_upsert)
stmt.close()
conn.close()
}}
}
how to code by c3p0 or other ? I'll appreciate your help.
Below is the code for reading data from kafka and inserting the data to kudu.
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kudu.client.KuduClient
import org.apache.kudu.client.KuduSession
import org.apache.kudu.client.KuduTable
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.Milliseconds,
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import scala.collection.immutable.List
import scala.collection.mutable
import scala.util.control.NonFatal
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types._
import org.apache.kudu.Schema
import org.apache.kudu.Type._
import org.apache.kudu.spark.kudu.KuduContext
import scala.collection.mutable.ArrayBuffer
object KafkaKuduConnect extends Serializable {
def main(args: Array[String]): Unit = {
try {
val TopicName="TestTopic"
val kafkaConsumerProps = Map[String, String]("bootstrap.servers" -> "localhost:9092")
val KuduMaster=""
val KuduTable=""
val sparkConf = new SparkConf().setAppName("KafkaKuduConnect")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val ssc = new StreamingContext(sc, Milliseconds(1000))
val kuduContext = new KuduContext(KuduMaster, sc)
val kuduclient: KuduClient = new KuduClient.KuduClientBuilder(KuduMaster).build()
//Opening table
val kudutable: KuduTable = kuduclient.openTable(KuduTable)
// getting table schema
val tableschema: Schema = kudutable.getSchema
// creating the schema for the data frame using the table schema
val FinalTableSchema =generateStructure(tableschema)
//To create the schema for creating the data frame from the rdd
def generateStructure(tableSchema:Schema):StructType=
{
var structFieldList:List[StructField]=List()
for(index <-0 until tableSchema.getColumnCount)
{
val col=tableSchema.getColumnByIndex(index)
val coltype=col.getType.toString
println(coltype)
col.getType match {
case INT32 =>
structFieldList=structFieldList:+StructField(col.getName,IntegerType)
case STRING =>
structFieldList=structFieldList:+StructField(col.getName,StringType)
case _ =>
println("No Class Type Found")
}
}
return StructType(structFieldList)
}
// To create the Row object with values type casted according to the schema
def getRow(schema:StructType,Data:List[String]):Row={
var RowData=ArrayBuffer[Any]()
schema.zipWithIndex.foreach(
each=>{
var Index=each._2
each._1.dataType match {
case IntegerType=>
if(Data(Index)=="" | Data(Index)==null)
RowData+=Data(Index).toInt
case StringType=>
RowData+=Data(Index)
case _=>
RowData+=Data(Index)
}
}
)
return Row.fromSeq(RowData.toList)
}
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerProps, Set(TopicName))
messages.foreachRDD(
//we are looping through eachrdd
eachrdd=>{
//we are creating the Rdd[Row] to create dataframe with our schema
val StructuredRdd= eachrdd.map(eachmessage=>{
val record=eachmessage._2
getRow(FinalTableSchema, record.split(",").toList)
})
//DataFrame with required structure according to the table.
val DF = sqlContext.createDataFrame(StructuredRdd, FinalTableSchema)
kuduContext.upsertRows(DF,KuduTable)
}
)
}
catch {
case NonFatal(e) => {
print("Error in main : " + e)
}
}
}
}

How to connect Hbase With JDBC driver of Apache Drill programmatically

I tried to use JDBC driver of Apache Drill programatically.
Here's the code:
import java.sql.DriverManager
object SearchHbaseWithHbase {
def main(args: Array[String]): Unit = {
Class.forName("org.apache.drill.jdbc.Driver")
val zkIp = "192.168.3.2:2181"
val connection = DriverManager.getConnection(s"jdbc:drill:zk=${zkIp};schema:hbase")
connection.setSchema("hbase")
println(connection.getSchema)
val st = connection.createStatement()
val rs = st.executeQuery("SELECT * FROM Label")
while (rs.next()){
println(rs.getString(1))
}
}
}
I have set the database schema with type : hbase, Like:
connection.setSchema("hbase")
But it fails with the error code:
Exception in thread "main" java.sql.SQLException: VALIDATION ERROR:
From line 1, column 15 to line 1, column 19: Table 'Label' not found
SQL Query null
The Label table is exactly exit in my hbase.
I can find My data when I use sqline like:
sqline -u jdbc:drill:zk....
use hbase;
input :select * from Label;
I have solved this problem. I confused the drill's schema and jdbc driver schema......
the correct codes should be like:
object SearchHbaseWithHbase{
def main(args: Array[String]): Unit = {
Class.forName("org.apache.drill.jdbc.Driver")
val zkIp = "192.168.3.2:2181"
val p = new java.util.Properties
p.setProperty("schema","hbase")
// val connectionInfo = new ConnectionInfo
val url = s"jdbc:drill:zk=${zkIp}"
val connection = DriverManager.getConnection(url, p)
// connection.setSchema("hbase")
// println(connection.getSchema)
val st = connection.createStatement()
val rs = st.executeQuery("SELECT * FROM Label")
while (rs.next()){
println(rs.getString(1))
}
}
}

Resources