HDFS caching with SparkSQL - caching

env
spark 3.1.2
hive 3.1.2
hadoop 3.2.1
problem
spark sql with hive metastore connection
for example, make table A with hdfs caching data in memory.
and make table B with non-hdfs caching data in hdfs.( but data content is same. )
I executed same query with table A and B.
I expected A table's query execution time must be faster than B table's query, but it didn't.
Actually query execution time between A and B table was almost same.
Is there anything to do enable HDFS caching with SparkSQL using hive metastore?
example code
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as Fd
from pyspark.sql.window import Window as W
spark_executor_cores = 8
spark_executor_memory = '3g'
spark_instances = 50
spark_shuffle_partitions = 1000
spark_default_parallelism = 1000
conf = SparkConf()
conf.setAppName("test application")
conf.set('spark.yarn.queue', 'default')
conf.set('spark.executor.memory', str(spark_executor_memory))
conf.set('spark.executor.instances', str(spark_instances))
conf.set('spark.shuffle.sort.bypassMergeThreshold', spark_instances * int(spark_executor_cores))
conf.set("spark.dynamicAllocation.enabled", "false")
conf.set("spark.sql.shuffle.partitions", str(spark_shuffle_partitions))
conf.set("spark.default.parallelism", str(spark_default_parallelism))
conf.set('spark.sql.adaptive.enabled', 'true')
conf.set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
conf.set('spark.sql.adaptive.localShuffleReader.enabled', 'true')
conf.set('spark.sql.adaptive.skewJoin.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true')
conf.set('spark.sql.warehouse.dir', metastore_dir)
conf.set('spark.hadoop.javax.jdo.option.ConnectionURL', metastore_url)
ss = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()
sc = SparkContext.getOrCreate()
spark = ss
sql = spark.sql
# example query
# partition = hour, logtype
sql("""
SELECT <columns>,
...
...
FROM <table name>
WHERE hour = <hour>
AND logtype = <logtype>
group by logtype
""").show()

Related

pyspark An error occurred while calling o23.jdbc. : java.lang.ClassNotFoundException: com.mariadb.jdbc.Driver

I'm going to read tables from mariadb database using pyspark.And an error occur while running the below code
'''
jdbcHostname = "localhost"
jdbcDatabase = "pucsl"
jdbcPort = 3307
jdbcUrl = "jdbc:mariadb://{0}:{1}/{2}?user={3}&password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, "root", "ravi")
df = spark.read.jdbc(url=jdbcUrl, table="m00_02_lic_lic_reln",properties={"driver": 'com.mariadb.jdbc.Driver'})
Currently Spark does not correctly recognize mariadb specific jdbc connect strings and so the jdbc:mysql syntax must be used. The followings shows a simple pyspark script to query the results from ColumnStore UM server columnstore_1 into a spark dataframe:
from pyspark import SparkContext
from pyspark.sql import DataFrameReader, SQLContext
url = 'jdbc:mysql://columnstore_1:3306/test'
properties = {'user': 'root', 'driver': 'org.mariadb.jdbc.Driver'}
sc = SparkContext("local", "ColumnStore Simple Query Demo")
sqlContext = SQLContext(sc)
df = DataFrameReader(sqlContext).jdbc(url='%s' % url, table='results', properties=properties)
df.show()
p.s~ I believe you have successfully added MariaDB jar in place(Something like /spark3.1.2/lib/maridabjar...)

Hive internal table takes block size larger than configured

My configuration for dfs.blocksize is 128M, and if I upload any file or create any file, it takes block of size 128M which is cool. But when I create hive table it , however small may be, takes block size of 256M.
Can we set size of table when they are created? I don't know how it's done
UPDATE
I am using spark sql.
spark = SparkSession .builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://bigdata-
namenode:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
//.master("local")
.getOrCreate();
String query1 = String.format("INSERT INTO TABLE bm_top."+orc_table+" SELECT icode, store_code,division,from_unixtime(unix_timestamp(bill_date,'dd-MMM-yy'),'yyyy-MM-dd'), qty, bill_no, mrp FROM bm_top.temp_ext_table");
spark.sql(query1);

Can we able to access tables from two different hive2 servers using multiple sparksessions

Can we able to access tables from two different hive2 servers using two SparkSessions like below:
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
val sparkdestination = SparkSession.builder().master("local") .appName("spark remote2")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.42:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.42:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val source = spark.sql("SELECT * from sample.source limit 20")
import sparkdestination.implicits._
import sparkdestination.sql
val destination = sparkdestination.sql("select * from sample.destination limit 20")
Or is there is any possible way we can able to access two different database tables like mysql-hive for comaparison using multiple sparksessions.
Thank you
Finally I found the solution for my concern of how to use multiple SparkSession, its achieved by:
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
After clearing the sessions we can reassign it with different values.

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

How to read a record from HBase then store into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase?

So I want to write a code to read a record from Hadoop HBase then store it into Spark RDD (Resilient Distributed Datasets); and read one RDD record then write into HBase. I have ZERO knowledge about either of the two and I need to use AWS cloud or Hadoop virtual machine. Someone please guide me to start from scratch.
Please make use of the basic code in Scala where we are reading the data in HBase using Scala. Similarly you can write a table creation to write the data into HBase
import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark._
object HBaseApp {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("HBaseApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table1"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 100000)
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
val tableDesc = new HTableDescriptor(tableName)
admin.createTable(tableDesc)
}
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())
sc.stop()
}
}

Resources