I am using pyspark for spark streaming. I am able to stream and create the dataframe properly with no issues. I was also able to insert data into Impala table created with only a few(5) sampled columns out of the overall columns(72) in the message from Kafka. But when I create a new a table with proper data types and columns, similarly the dataframe now has all the columns mentioned in the message of Kafka stream. I get the below exception.
java.sql.SQLFeatureNotSupportedException: [Cloudera]JDBC Driver does not support this optional feature.
at com.cloudera.impala.exceptions.ExceptionConverter.toSQLException(Unknown Source)
at com.cloudera.impala.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown Source)
at com.cloudera.impala.jdbc.common.SPreparedStatement.setNull(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:627)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:782)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:782)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2064)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have searched a lot on this, but could not find any solution on this. I enabled debug logs as well, still it won't mention what feature does the driver not support.
Any help or proper guidance would be appreciated.
Thank you
Version details :
pyspark : 2.2.0
Kafka : 0.10.2
Cloudera : 5.15.0
Cloudera Impala : 2.12.0-cdh5.15.0
Cloudera Impala JDBC driver : 2.6.4
The code I have used :
import json
from pyspark import SparkContext,SparkConf,HiveContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SparkSession,Row
from pyspark.sql.functions import lit
from pyspark.sql.types import *
conf = SparkConf().setAppName("testkafkarecvstream")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 10)
spark = SparkSession.builder.appName("testkafkarecvstream").getOrCreate()
jdbcUrl = "jdbc:impala://hostname:21050/dbName;AuthMech=0;"
fields = [
StructField("column_name01", StringType(), True),
StructField("column_name02", StringType(), True),
StructField("column_name03", DoubleType(), True),
StructField("column_name04", StringType(), True),
StructField("column_name05", IntegerType(), True),
StructField("column_name06", StringType(), True),
.....................
StructField("column_name72", StringType(), True),
]
schema = StructType(fields)
def make_rows(parts):
customRow = Row(column_name01=datatype(parts['column_name01']),
.....,
column_name72=datatype(parts['column_name72'])
)
return customRow
def createDFToParquet(rdd):
try:
df = spark.createDataFrame(rdd,schema)
df.show()df.write.jdbc(jdbcUrl,
table="table_name",
mode="append",)
except Exception as e:
print str(e)
zkNode = "zkNode_name:2181"
topic = "topic_name"
# Reciever method
kvs = KafkaUtils.createStream(ssc,
zkNode,
"consumer-group-id",
{topic:5},
{"auto.offset.reset" : "smallest"})
lines = kvs.map(lambda x: x[1])
conv = lines.map(lambda x: json.loads(x))
table = conv.map(makeRows)
table.foreachRDD(createDFToParquet)
table.pprint()
ssc.start()
ssc.awaitTermination()
Related
env
spark 3.1.2
hive 3.1.2
hadoop 3.2.1
problem
spark sql with hive metastore connection
for example, make table A with hdfs caching data in memory.
and make table B with non-hdfs caching data in hdfs.( but data content is same. )
I executed same query with table A and B.
I expected A table's query execution time must be faster than B table's query, but it didn't.
Actually query execution time between A and B table was almost same.
Is there anything to do enable HDFS caching with SparkSQL using hive metastore?
example code
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as Fd
from pyspark.sql.window import Window as W
spark_executor_cores = 8
spark_executor_memory = '3g'
spark_instances = 50
spark_shuffle_partitions = 1000
spark_default_parallelism = 1000
conf = SparkConf()
conf.setAppName("test application")
conf.set('spark.yarn.queue', 'default')
conf.set('spark.executor.memory', str(spark_executor_memory))
conf.set('spark.executor.instances', str(spark_instances))
conf.set('spark.shuffle.sort.bypassMergeThreshold', spark_instances * int(spark_executor_cores))
conf.set("spark.dynamicAllocation.enabled", "false")
conf.set("spark.sql.shuffle.partitions", str(spark_shuffle_partitions))
conf.set("spark.default.parallelism", str(spark_default_parallelism))
conf.set('spark.sql.adaptive.enabled', 'true')
conf.set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
conf.set('spark.sql.adaptive.localShuffleReader.enabled', 'true')
conf.set('spark.sql.adaptive.skewJoin.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true')
conf.set('spark.sql.warehouse.dir', metastore_dir)
conf.set('spark.hadoop.javax.jdo.option.ConnectionURL', metastore_url)
ss = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()
sc = SparkContext.getOrCreate()
spark = ss
sql = spark.sql
# example query
# partition = hour, logtype
sql("""
SELECT <columns>,
...
...
FROM <table name>
WHERE hour = <hour>
AND logtype = <logtype>
group by logtype
""").show()
I'm going to read tables from mariadb database using pyspark.And an error occur while running the below code
'''
jdbcHostname = "localhost"
jdbcDatabase = "pucsl"
jdbcPort = 3307
jdbcUrl = "jdbc:mariadb://{0}:{1}/{2}?user={3}&password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, "root", "ravi")
df = spark.read.jdbc(url=jdbcUrl, table="m00_02_lic_lic_reln",properties={"driver": 'com.mariadb.jdbc.Driver'})
Currently Spark does not correctly recognize mariadb specific jdbc connect strings and so the jdbc:mysql syntax must be used. The followings shows a simple pyspark script to query the results from ColumnStore UM server columnstore_1 into a spark dataframe:
from pyspark import SparkContext
from pyspark.sql import DataFrameReader, SQLContext
url = 'jdbc:mysql://columnstore_1:3306/test'
properties = {'user': 'root', 'driver': 'org.mariadb.jdbc.Driver'}
sc = SparkContext("local", "ColumnStore Simple Query Demo")
sqlContext = SQLContext(sc)
df = DataFrameReader(sqlContext).jdbc(url='%s' % url, table='results', properties=properties)
df.show()
p.s~ I believe you have successfully added MariaDB jar in place(Something like /spark3.1.2/lib/maridabjar...)
I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")
I am trying to transform the Kafka Stream into RDDs and insert these RDDs into an Elasticsearch database. This is my code:
conf = SparkConf().setAppName("ola")
sc = SparkContext(conf=conf)
es_write_conf = {
"es.nodes": "localhost",
"es.port": "9200",
"es.resource": "pipe/word"
}
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
value_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
value_counts.transform(lambda rdd: rdd.map(f))
value_counts.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
ssc.start()
ssc.awaitTermination()
The saveAsNewAPIHadoopFile function should write those RDDs to ES. However I get this error:
value_counts.saveAsNewAPIHadoopFile(
AttributeError: 'TransformedDStream' object has no attribute 'saveAsNewAPIHadoopFile'
The transform functions should be able to convert the stream to Spark dataframes. How can I write these RDD into Elasticsearch? Thanks!
You can use foreachRDD:
value_counts.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(...))
new = rawUser.rdd.map(lambda item: ('key', {'id': item['entityId'],'targetEntityId': item['targetEntityId']}))
rawUser is DATAFRAME and
new is PipelinedRDD
new.saveAsNewAPIHadoopFile(
path='/home/aakash/test111/',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={ "es.resource" : "index/test" ,"es.mapping.id":"id","es.nodes" : "localhost","es.port" : "9200","es.nodes.wan.only":"false"})
Most important thing here is download proper compatible JAR
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-hadoop
check version of elastic and download proper jar.
Command to make pyspark use jar.
pyspark --jars elasticsearch-hadoop-6.2.4.jar
So I want to write a code to read a record from Hadoop HBase then store it into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase. I have ZERO knowledge about either of the two and I need to use AWS cloud or Hadoop virtual machine. Someone please guide me to start from scratch.
Please make use of the basic code in Scala where we are reading the data in HBase using Scala. Similarly you can write a table creation to write the data into HBase
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark._
object HBaseApp {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table1"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 100000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())
sc.stop()
}
}