Pyspark Write DStream data to Elasticsearch using saveAsNewAPIHadoopFile - elasticsearch

I am trying to transform the Kafka Stream into RDDs and insert these RDDs into an Elasticsearch database. This is my code:
conf = SparkConf().setAppName("ola")
sc = SparkContext(conf=conf)
es_write_conf = {
"es.nodes": "localhost",
"es.port": "9200",
"es.resource": "pipe/word"
}
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
value_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
value_counts.transform(lambda rdd: rdd.map(f))
value_counts.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
ssc.start()
ssc.awaitTermination()
The saveAsNewAPIHadoopFile function should write those RDDs to ES. However I get this error:
value_counts.saveAsNewAPIHadoopFile(
AttributeError: 'TransformedDStream' object has no attribute 'saveAsNewAPIHadoopFile'
The transform functions should be able to convert the stream to Spark dataframes. How can I write these RDD into Elasticsearch? Thanks!

You can use foreachRDD:
value_counts.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(...))

new = rawUser.rdd.map(lambda item: ('key', {'id': item['entityId'],'targetEntityId': item['targetEntityId']}))
rawUser is DATAFRAME and
new is PipelinedRDD
new.saveAsNewAPIHadoopFile(
path='/home/aakash/test111/',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={ "es.resource" : "index/test" ,"es.mapping.id":"id","es.nodes" : "localhost","es.port" : "9200","es.nodes.wan.only":"false"})
Most important thing here is download proper compatible JAR
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-hadoop
check version of elastic and download proper jar.
Command to make pyspark use jar.
pyspark --jars elasticsearch-hadoop-6.2.4.jar

Related

Spark Streaming With JDBC Source and Redis Stream

I'm trying to build a little mix of technologies to implement a solution on my work. Since I'm new to most of them, sometimes I got stuck, but could find solution to some of the problems I'm facing. Right now, both objects are running on Spark, but I can't seem to identify why the Streaming are not working.
Maybe is the way redis implements its sink on the writing to stream side, maybe is the way I'm trying to do the job. Almost all of the examples I found on streaming are related to Spark samples, like streaming text or TCP, and the only solution I found on relational databases are based on kafka connect, which I can't use right now because the company doesn't have the Oracle option to CDC on Kafka.
My scenario is as follows. Build a Oracle -> Redis Stream -> MongoDB Spark application.
I've built my code based on the examples of spark redis And used the sample code to try implement a solution to my case. I load the Oracle data day by day and send to a redis stream which later will be extracted from the stream and saved to Mongo. Right now the sample below is just trying to remove from the stream and show on console, but nothing is shown.
The little 'trick' I've tried was to create a CSV directory, read from it, and later grab the date from the csv and use to query the oracle db, then saving the oracle DataFrame on redis with the foreachBatch command. The data is saved, but I think not in the right way, because using the sample code to read the stream nothing is received.
Those are the codes:
** Writing to Stream **
object SendData extends App {
Logger.getLogger("org").setLevel(Level.INFO)
val oracleHost = scala.util.Properties.envOrElse("ORACLE_HOST", "<HOST_IP>")
val oracleService = scala.util.Properties.envOrElse("ORACLE_SERVICE", "<SERVICE>")
val oracleUser = scala.util.Properties.envOrElse("ORACLE_USER", "<USER>")
val oraclePwd = scala.util.Properties.envOrElse("ORACLE_PWD", "<PASSWD>")
val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
val oracleUrl = "jdbc:oracle:thin:#//" + oracleHost + "/" + oracleService
val userSchema = new StructType().add("DTPROCESS", "string")
val spark = SparkSession
.builder()
.appName("Send Data")
.master("local[*]")
.config("spark.redis.host", redisHost)
.config("spark.redis.port", redisPort)
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val csvDF = spark.readStream.option("header", "true").schema(userSchema).csv("/tmp/checkpoint/*.csv")
val output = csvDF
.writeStream
.outputMode("update")
.foreachBatch {(df :DataFrame, batchId: Long) => {
val dtProcess = df.select(col("DTPROCESS")).first.getString(0).take(10)
val query = s"""
(SELECT
<FIELDS>
FROM
TABLE
WHERE
DTPROCESS BETWEEN (TO_TIMESTAMP('$dtProcess 00:00:00.00', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
AND (TO_TIMESTAMP('$dtProcess 23:59:59.99', 'YYYY-MM-DD HH24:MI:SS.FF') + 1)
) Table
"""
val df = spark.read
.format("jdbc")
.option("url", oracleUrl)
.option("dbtable", query)
.option("user", oracleUser)
.option("password", oraclePwd)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.load()
df.cache()
if (df.count() > 0) {
df.write.format("org.apache.spark.sql.redis")
.option("table", "process")
.option("key.column", "PRIMARY_KEY")
.mode(SaveMode.Append)
.save()
}
if ((new DateTime(dtProcess).toLocalDate()).equals(new LocalDate()))
Seq(dtProcess).toDF("DTPROCESS")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.mode("overwrite")
.option("header", "true")
.save("/tmp/checkpoint")
else {
val nextDay = new DateTime(dtProcess).plusDays(1)
Seq(nextDay.toString(DateTimeFormat.forPattern("YYYY-MM-dd"))).toDF("DTPROCESS")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.mode("overwrite")
.option("header", "true")
.save("/tmp/checkpoint")
}
}}
.start()
output.awaitTermination()
}
** Reading from Stream **
object ReceiveData extends App {
Logger.getLogger("org").setLevel(Level.INFO)
val mongoPwd = scala.util.Properties.envOrElse("MONGO_PWD", "bpedes")
val redisHost = scala.util.Properties.envOrElse("REDIS_HOST", "<REDIS_IP>")
val redisPort = scala.util.Properties.envOrElse("REDIS_PORT", "6379")
val spark = SparkSession
.builder()
.appName("Receive Data")
.master("local[*]")
.config("spark.redis.host", redisHost)
.config("spark.redis.port", redisPort)
.getOrCreate()
val processes = spark
.readStream
.format("redis")
.option("stream.keys", "process")
.schema(StructType(Array(
StructField("FIELD_1", StringType),
StructField("PRIMARY_KEY", StringType),
StructField("FIELD_3", TimestampType),
StructField("FIELD_4", LongType),
StructField("FIELD_5", StringType),
StructField("FIELD_6", StringType),
StructField("FIELD_7", StringType),
StructField("FIELD_8", TimestampType)
)))
.load()
val query = processes
.writeStream
.format("console")
.start()
query.awaitTermination()
}
This code writes the dataframe to Redis as hashes (not to the Redis stream).
df.write.format("org.apache.spark.sql.redis")
.option("table", "process")
.option("key.column", "PRIMARY_KEY")
.mode(SaveMode.Append)
.save()
Spark-redis doesn't support writing to Redis stream out of the box.

Pyspark Impala jdbc Driver does not support this optional feature

I am using pyspark for spark streaming. I am able to stream and create the dataframe properly with no issues. I was also able to insert data into Impala table created with only a few(5) sampled columns out of the overall columns(72) in the message from Kafka. But when I create a new a table with proper data types and columns, similarly the dataframe now has all the columns mentioned in the message of Kafka stream. I get the below exception.
java.sql.SQLFeatureNotSupportedException: [Cloudera]JDBC Driver does not support this optional feature.
at com.cloudera.impala.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.cloudera.impala.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown Source)
at com.cloudera.impala.jdbc.common.SPreparedStatement.setNull(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:627)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:782)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:782)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have searched a lot on this, but could not find any solution on this. I enabled debug logs as well, still it won't mention what feature does the driver not support.
Any help or proper guidance would be appreciated.
Thank you
Version details :
pyspark : 2.2.0
Kafka : 0.10.2
Cloudera : 5.15.0
Cloudera Impala : 2.12.0-cdh5.15.0
Cloudera Impala JDBC driver : 2.6.4
The code I have used :
import json
from pyspark import SparkContext,SparkConf,HiveContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SparkSession,Row
from pyspark.sql.functions import lit
from pyspark.sql.types import *
conf = SparkConf().setAppName("testkafkarecvstream")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 10)
spark = SparkSession.builder.appName("testkafkarecvstream").getOrCreate()
jdbcUrl = "jdbc:impala://hostname:21050/dbName;AuthMech=0;"
fields = [
StructField("column_name01", StringType(), True),
StructField("column_name02", StringType(), True),
StructField("column_name03", DoubleType(), True),
StructField("column_name04", StringType(), True),
StructField("column_name05", IntegerType(), True),
StructField("column_name06", StringType(), True),
.....................
StructField("column_name72", StringType(), True),
]
schema = StructType(fields)
def make_rows(parts):
customRow = Row(column_name01=datatype(parts['column_name01']),
.....,
column_name72=datatype(parts['column_name72'])
)
return customRow
def createDFToParquet(rdd):
try:
df = spark.createDataFrame(rdd,schema)
df.show()df.write.jdbc(jdbcUrl,
table="table_name",
mode="append",)
except Exception as e:
print str(e)
zkNode = "zkNode_name:2181"
topic = "topic_name"
# Reciever method
kvs = KafkaUtils.createStream(ssc,
zkNode,
"consumer-group-id",
{topic:5},
{"auto.offset.reset" : "smallest"})
lines = kvs.map(lambda x: x[1])
conv = lines.map(lambda x: json.loads(x))
table = conv.map(makeRows)
table.foreachRDD(createDFToParquet)
table.pprint()
ssc.start()
ssc.awaitTermination()

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file

How to read a record from HBase then store into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase?

So I want to write a code to read a record from Hadoop HBase then store it into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase. I have ZERO knowledge about either of the two and I need to use AWS cloud or Hadoop virtual machine. Someone please guide me to start from scratch.
Please make use of the basic code in Scala where we are reading the data in HBase using Scala. Similarly you can write a table creation to write the data into HBase
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark._
object HBaseApp {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table1"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 100000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())
sc.stop()
}
}

How to export data from Spark SQL to CSV

This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.
You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.
The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/
The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}
With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`
The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.
enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")

Resources