How to change hdfs block size of DataFrame in pysark - hadoop

This seems related to
How to change hdfs block size in pyspark?
I can successfully change the hdfs block size with rdd.saveAsTextFile,
but not the corresponding DataFrame.write.parquet and unable to save with parquet format.
Unsure whether it's the bug in pyspark DataFrame or I did not set the configurations correctly.
The following is my testing code:
##########
# init
##########
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import hdfs
from hdfs import InsecureClient
import os
import numpy as np
import pandas as pd
import logging
os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
block_size = 512 * 1024
conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", block_size)
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", block_size)
##########
# main
##########
# create DataFrame
df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, \{'temp': "!"}])
# save using DataFrameWriter, resulting 128MB-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')

Found the answer from the following link:
https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html
I can successfully setup parquet block size with spark.hadoop.parquet.block.size
The following is the sample code:
# init
block_size = 512 * 1024
conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set('spark.hadoop.parquet.block.size', str(block_size)).set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size", str(131072))
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
# create DataFrame
df_txt = spark.createDataFrame([{'temp': "hello"}, {'temp': "world"}, {'temp': "!"}])
# save using DataFrameWriter, resulting 512k-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
# save using DataFrameWriter.csv, resulting 512k-block-size
df_txt.write.mode('overwrite').csv('hdfs://spark1/tmp/temp_with_df_csv')
# save using DataFrameWriter.text, resulting 512k-block-size
df_txt.write.mode('overwrite').text('hdfs://spark1/tmp/temp_with_df_text')
# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')

Hadoop and Spark are two independent tools which have their own strategies to work. Spark and Parquet work with data partitions and block size is not meaningful for them. Do what Spark does say and then do what you want with it inside the HDFS.
You can change your Parquet partition number by
df_txt.repartition(6).format("parquet").save("hdfs://...")

Related

HDFS caching with SparkSQL

env
spark 3.1.2
hive 3.1.2
hadoop 3.2.1
problem
spark sql with hive metastore connection
for example, make table A with hdfs caching data in memory.
and make table B with non-hdfs caching data in hdfs.( but data content is same. )
I executed same query with table A and B.
I expected A table's query execution time must be faster than B table's query, but it didn't.
Actually query execution time between A and B table was almost same.
Is there anything to do enable HDFS caching with SparkSQL using hive metastore?
example code
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as Fd
from pyspark.sql.window import Window as W
spark_executor_cores = 8
spark_executor_memory = '3g'
spark_instances = 50
spark_shuffle_partitions = 1000
spark_default_parallelism = 1000
conf = SparkConf()
conf.setAppName("test application")
conf.set('spark.yarn.queue', 'default')
conf.set('spark.executor.memory', str(spark_executor_memory))
conf.set('spark.executor.instances', str(spark_instances))
conf.set('spark.shuffle.sort.bypassMergeThreshold', spark_instances * int(spark_executor_cores))
conf.set("spark.dynamicAllocation.enabled", "false")
conf.set("spark.sql.shuffle.partitions", str(spark_shuffle_partitions))
conf.set("spark.default.parallelism", str(spark_default_parallelism))
conf.set('spark.sql.adaptive.enabled', 'true')
conf.set('spark.sql.adaptive.coalescePartitions.enabled', 'true')
conf.set('spark.sql.adaptive.localShuffleReader.enabled', 'true')
conf.set('spark.sql.adaptive.skewJoin.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true')
conf.set('spark.sql.execution.arrow.pyspark.fallback.enabled', 'true')
conf.set('spark.sql.warehouse.dir', metastore_dir)
conf.set('spark.hadoop.javax.jdo.option.ConnectionURL', metastore_url)
ss = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()
sc = SparkContext.getOrCreate()
spark = ss
sql = spark.sql
# example query
# partition = hour, logtype
sql("""
SELECT <columns>,
...
...
FROM <table name>
WHERE hour = <hour>
AND logtype = <logtype>
group by logtype
""").show()

How to get the leader feature importances in h2o automl pysparkling water

i am using spark standalone cluster and running h2o pysparkling in it.
I am unable to find the function for getting the leader feature importances. please help
Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pyspark import SparkFiles
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
conf = H2OConf()
hc = H2OContext.getOrCreate(conf)
def xgb_automl_features_importance(data, target_metric):
# Converting DataFrame in H2OFrame
hf = h2o.H2OFrame(data)
sparkDF = hc.asSparkFrame(hf)
# Identify predictors and response
y = target_metric
aml = H2OAutoML(labelCol=y)
aml.setIncludeAlgos(["XGBoost"])
aml.setMaxModels(1)
aml.fit(sparkDF)
print('-----------****************')
print(aml.getLeaderboard().show(truncate=False))
The fit method on H2OAutoML returns the leader model. Each model in SW has the method getFeatureImportances() returning Spark data frame with feature importances.
model=aml.fit(sparkDF)
model.getFeatureImportances().show()

How to check if a file exists in Google Storage from Spark Dataproc?

I was assuming that Google Storage connector would allow to query GS directly as if it was HDFS from Spark in Dataproc, but it looks like the following does not work (from Spark Shell):
scala> import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.FileSystem
scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path
scala> FileSystem.get(sc.hadoopConfiguration).exists(new Path("gs://samplebucket/file"))
java.lang.IllegalArgumentException: Wrong FS: gs://samplebucket/file, expected: hdfs://dataprocmaster-m
Is there a way to access Google Storage files using just the Hadoop API?
That's because FileSystem.get(...) returns the default FileSystem which according to your configuration is HDFS and can only work with paths starting with hdfs://. Use the following to get the correct FS.
Path p = new Path("gs://...");
FileSystem fs = p.getFileSystem(...);
fs.exists(p);
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.fs.{FileSystem, Path}
val p = "gs://<your dir>"
val path = new Path(p)
val fs = path.getFileSystem(sc.hadoopConfiguration)
fs.exists(path)
fs.isDirectory(path)
I translated #Pradeep Gollakota answer to PySpark, thanks!!
def path_exists(spark, path): #path = gs://.... return true if exists
p = spark._jvm.org.apache.hadoop.fs.Path(path)
fs = p.getFileSystem(spark._jsc.hadoopConfiguration())
return fs.exists(p)

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

Read Lzo file in PySpark

I am new to Spark. I have a bunch of LZO indexed files in a folder. The indexing was done as indicated on https://github.com/twitter/hadoop-lzo.
The files are as follows:
1.lzo
1.lzo.index
2.lzo
2.lzo.index
and so on
I want to read these files. I am using newAPIHadoopFile().
As given on, https://github.com/twitter/hadoop-lzo
I did the following:
val files = sc.newAPIHadoopFile(path, classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
val lzoRDD = files.map(_._2.toString)
It worked fine in Scala (spark-shell).
But, I want to use pyspark (python-spark application). I am doing the following:
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
I get the following error: AttributeError: 'RDD' object has no attribute '_2'
The whole code is as follows:
import sys
from pyspark import SparkContext,SparkConf
if __name__ == "__main__":
#Create the SparkContext
conf = (SparkConf().setMaster("local[2]").setAppName("abc").set("spark.executor.memory", "10g").set("spark.cores.max",10))
sc = SparkContext(conf=conf)
path='/x/y/z/*.lzo'
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
#stop the SparkContext
sc.stop()
And I am submitting using spark-submit.
Any help would be appreciated.
Thank You

Resources