Why doesn't Hadoop respect 'spark.hadoop.fs' properties set in pyspark? [duplicate] - hadoop

This question already has answers here:
How to set hadoop configuration values from pyspark
(3 answers)
Closed 5 years ago.
There are three properties in my spark-defaults.conf that I want to be able to set dynamically:
spark.driver.maxResultSize
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key
Here's my attempt to do so:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
Here's the output I get:
5g
<access>
<secret>
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
If I set the credentials via environment variables, I am able to read the file.
Why doesn't Hadoop respect the credentials specified this way?
Update:
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.

It turns out that you can't specify Hadoop properties via:
spark.conf.set('spark.hadoop.<property>', <value>)
but you must instead use:
spark.sparkContext._jsc.hadoopConfiguration().set('<property>', <value>)
I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.

Related

Accessing streaming data from ADLS in PyFlink

Attempting a Pyflink stream setup from ADLS and currently trying to read a json file using the StreamExecutionEnvironment.from_source() method.
Here's how the code looks like:
from flink.plan.Environment import get_environment
from pyflink.datastream.functions import SourceFunction
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import (FileSource, StreamFormat, FileSink,
OutputFileConfig, RollingPolicy, BucketAssigner)
from pyflink.common import WatermarkStrategy, Encoder, Types
from azure.storage.filedatalake import FileSystemClient
file_system = FileSystemClient.from_connection_string(connection_str, file_system_name="my_fs")
# setting the stream environment object
env = StreamExecutionEnvironment.get_execution_environment()
env.set_runtime_mode(RuntimeExecutionMode.STREAMING)
env.add_jars("file:///opt/flink/plugins/azure/flink-azure-fs-hadoop-1.16.0.jar")
env.add_classpaths("file:///opt/flink/plugins/azure/flink-azure-fs-hadoop-1.16.0.jar")
file_client = file_system.get_file_client(my_file)
input_path = 'abfss://' + file_client.url[8:]
print('URL is ===== >>>>',file_client.url)
# Source
ds = env.from_source(
source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
input_path)
.process_static_file_set().build(),
watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
source_name="file_source"
)
ds.sink_to(
sink=FileSink.for_row_format(
base_path=output_path,
encoder=Encoder.simple_string_encoder())
.with_bucket_assigner(BucketAssigner.base_path_bucket_assigner())
.build())
ds.print()
env.execute()
I am getting below error:
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'abfs'. The scheme is directly supported by Flink through the following plugin(s): flink-fs-azure-hadoop. Please ensure that each plugin resides within its own subfolder within the plugins directory. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/ for more information. If you want to use a Hadoop file system for that scheme, please add the scheme to the configuration fs.allowed-fallback-filesystems. For a full list of supported file systems, please see https://nightlies.apache.org/flink/flink-docs-stable/ops/filesystems/.
Jar file has already been added in to the plugins folder as given in the documentation:
https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/
Also, storage account key is also added into the config.yaml file.
Alternatively tried adding the source as DataStream using :
ds = env.read_text_file(input_path)

Kedro context and catalog missing from Jupyter Notebook

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar
I think a combination of this might help you:
Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')
Not sure why you would need to add the PyFile though, but maybe you have a specific reason.

Is there any way in Spark to keep each stage's run time?

I am measuring the run-times of a spark job with different resource configurations and need to compare the run time of each stage. I can see them in UI only when the job is running.
I run my job on a Hadoop cluster and use Yarn as the resource manager.
Is there any way to keep each stage's run-time? Is there any log for them?
UPDATE:
I read the monitoring document which is mentioned in the comment and add the following lines but it doesn't work:
in spark-defaults.conf :
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///[nameNode]:8020/[PathToSparkEventLogDir]
spark.history.fs.logDirectory
hdfs:///[nameNode]:8020/[PathTosparkLogDirectory]
in spark-env.sh:
export SPARK_PUBLIC_DNS=[nameNode]
SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=$sparkHistoryDir"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.interval=7d"
It looks for /tmp/spark-events/ folder and when I create it and start the history server, it doesn't show any complete or incomplete application.
Note I tried the logDirectory value without port number too but it didn't work.
I could run the Spark History Server and see the history of completed and incompleted applications by applying the following commands:
Set the public DNS value in conf/spark-env.sh
export SPARK_PUBLIC_DNS= NameNode-IP
Add these properties to SparkConf in my Java code:
SparkConf conf = new SparkConf()
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs:///user/[user-path]/sparkEventLog")
.set("spark.history.fs.logDirectory", "hdfs:///user/[user-path]/sparkEventLog")
Create the property file ( spark/conf/history.properties ) containg the following lines
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/[user-path]/sparkEventLog
spark.history.fs.logDirectory hdfs:///user/[user-path]/sparkEventLog
Start the history server:
./sbin/start-history-server.sh --properties-file ./conf/history.properties
Note: The properties eventLog.dir and eventLog.dir should have the save values.

Adding some hadoop configuration to a spark application at runtime (through spark-submit)?

I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?
Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html
Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.

Pig script to read Cassandra table

Trying to write a Pig script that will extract data from a Cassandra table. The Pig script looks like this:
REGISTER ./cassandra-all-2.0.8.39.jar
REGISTER ./datastax-agent-4.1.4-standalone.jar
REGISTER ./cassandra-driver-core-2.0.2.1.jar
REGISTER ./apache-cassandra-thrift-2.0.12.jar
A = LOAD 'cql://username:password/mykeyspace/mycolumnfamily'
USING org.apache.cassandra.hadoop.pig.CqlStorage()
AS (user_id:long, fname:chararray, last_update_date:chararray, lname:chararray);
DUMP A;
I keep getting following message:
java.lang.NoSuchMethodError: org.apache.cassandra.thrift.ITransportFactory.supportedOptions()Ljava/util/Set
So far I’ve tried following version of ‘cassandra-thrift.jar’: 2.0.8.39, 2.0.12, 2.1.2. None of them has helped.
Note: Following environment variables have been set outside this script:
export PIG_INITIAL_ADDRESS=machine-name
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
Any suggestions?

Resources