I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . But I am not able to find any good example to use the AWS.
Questions
How do I use Amazon S3 as checkpoint directory ?Is it just enough to call
ssc.checkpoint(amazons3url) ?
Is it possible to have any other reliable data storage other than hadoop file system for checkpoint ?
From the answer in the link
Solution 1:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)
Set the checkpoint directory as S3 URL -
s3n://spark-streaming/checkpoint
And then launch your spark application using spark submit.
This works in spark 1.4.2
solution 2:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
StreamingContext.getOrCreate(checkPointDir, () => {
createStreamingContext(checkPointDir, config)
}, hadoopConf)
To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit method
s3n://<aws-access-key>:<aws-secret-key>#<s3-bucket>/<prefix ...>
Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon
Related
I am measuring the run-times of a spark job with different resource configurations and need to compare the run time of each stage. I can see them in UI only when the job is running.
I run my job on a Hadoop cluster and use Yarn as the resource manager.
Is there any way to keep each stage's run-time? Is there any log for them?
UPDATE:
I read the monitoring document which is mentioned in the comment and add the following lines but it doesn't work:
in spark-defaults.conf :
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///[nameNode]:8020/[PathToSparkEventLogDir]
spark.history.fs.logDirectory
hdfs:///[nameNode]:8020/[PathTosparkLogDirectory]
in spark-env.sh:
export SPARK_PUBLIC_DNS=[nameNode]
SPARK_HISTORY_OPTS="-Dspark.eventLog.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=$sparkHistoryDir"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.enabled=true"
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.cleaner.interval=7d"
It looks for /tmp/spark-events/ folder and when I create it and start the history server, it doesn't show any complete or incomplete application.
Note I tried the logDirectory value without port number too but it didn't work.
I could run the Spark History Server and see the history of completed and incompleted applications by applying the following commands:
Set the public DNS value in conf/spark-env.sh
export SPARK_PUBLIC_DNS= NameNode-IP
Add these properties to SparkConf in my Java code:
SparkConf conf = new SparkConf()
.set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "hdfs:///user/[user-path]/sparkEventLog")
.set("spark.history.fs.logDirectory", "hdfs:///user/[user-path]/sparkEventLog")
Create the property file ( spark/conf/history.properties ) containg the following lines
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/[user-path]/sparkEventLog
spark.history.fs.logDirectory hdfs:///user/[user-path]/sparkEventLog
Start the history server:
./sbin/start-history-server.sh --properties-file ./conf/history.properties
Note: The properties eventLog.dir and eventLog.dir should have the save values.
I'm attempting to serve Tensorflow models out of HDFS using Tensorflow Serving project.
I'm running tensorflow serving docker container tag 1.10.1
https://hub.docker.com/r/tensorflow/serving
I can see tensorflow/serving repo referencing Hadoop at
https://github.com/tensorflow/serving/blob/628702e1de1fa3d679369e9546e7d74fa91154d3/tensorflow_serving/model_servers/BUILD#L341
"#org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system"
This is a reference to
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/hadoop/hadoop_file_system.cc
I have set the following environmental variables:
HADOOP_HDFS_HOME to point to my HDFS home (/etc/hadoop in my case).
MODEL_BASE_PATH set to "hdfs://tensorflow/models"
MODEL_NAME set to name of model I wish to load
I mount Hadoop home into docker container and can verify it using docker exec.
When I run the docker container I get the following in logs:
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:369] FileSystemStoragePathSource encountered a file-system access error: Could not find base path hdfs://tensorflow/models/my_model for servable my_model
I have found examples of Tensorflow doing training using HDFS, but not serving models from HDFS using Tensorflow Serving.
Can Tensorflow Serving serve models from HDFS?
If so, how do you do this?
In BUILD of model_servers, under the cc_test for get_model_status_impl_test, add this line #org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system, as shown below:
cc_test(
name = "get_model_status_impl_test",
size = "medium",
srcs = ["get_model_status_impl_test.cc"],
data = [
"//tensorflow_serving/servables/tensorflow/testdata:saved_model_half_plus_two_2_versions",
],
deps = [
":get_model_status_impl",
":model_platform_types",
":platform_config_util",
":server_core",
"//tensorflow_serving/apis:model_proto",
"//tensorflow_serving/core:availability_preserving_policy",
"//tensorflow_serving/core/test_util:test_main",
"//tensorflow_serving/servables/tensorflow:saved_model_bundle_source_adapter_proto",
"//tensorflow_serving/servables/tensorflow:session_bundle_config_proto",
"//tensorflow_serving/servables/tensorflow:session_bundle_source_adapter_proto",
"//tensorflow_serving/test_util",
"#org_tensorflow//tensorflow/cc/saved_model:loader",
"#org_tensorflow//tensorflow/cc/saved_model:signature_constants",
"#org_tensorflow//tensorflow/contrib/session_bundle",
"#org_tensorflow//tensorflow/core:test",
"#org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
],
)
I think this would solve your problem.
Ref: Fail to load the models from HDFS
I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?
Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html
Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.
This question already has answers here:
How to set hadoop configuration values from pyspark
(3 answers)
Closed 5 years ago.
There are three properties in my spark-defaults.conf that I want to be able to set dynamically:
spark.driver.maxResultSize
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key
Here's my attempt to do so:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
Here's the output I get:
5g
<access>
<secret>
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
If I set the credentials via environment variables, I am able to read the file.
Why doesn't Hadoop respect the credentials specified this way?
Update:
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.
It turns out that you can't specify Hadoop properties via:
spark.conf.set('spark.hadoop.<property>', <value>)
but you must instead use:
spark.sparkContext._jsc.hadoopConfiguration().set('<property>', <value>)
I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.
I have an ALS model that I train on one spark installation. I persist as so:
model.save(sc, './recommender_model')
On the filesystem it looks like this:
$ find ./recommender_model
./recommender_model
./recommender_model/metadata
./recommender_model/metadata/_SUCCESS
./recommender_model/metadata/._SUCCESS.crc
./recommender_model/metadata/.part-00000.crc
./recommender_model/metadata/part-00000
./recommender_model/data
./recommender_model/data/product
./recommender_model/data/product/part-r-00001-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet
./recommender_model/data/product/_common_metadata
./recommender_model/data/product/.part-r-00000-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet.crc
./recommender_model/data/product/.part-r-00001-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet.crc
./recommender_model/data/product/_SUCCESS
./recommender_model/data/product/._metadata.crc
./recommender_model/data/product/._SUCCESS.crc
./recommender_model/data/product/._common_metadata.crc
./recommender_model/data/product/part-r-00000-406655c7-5c12-44d4-9b39-d5367ccabe29.gz.parquet
./recommender_model/data/product/_metadata
./recommender_model/data/user
./recommender_model/data/user/_common_metadata
./recommender_model/data/user/.part-r-00001-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet.crc
./recommender_model/data/user/_SUCCESS
./recommender_model/data/user/.part-r-00000-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet.crc
./recommender_model/data/user/._metadata.crc
./recommender_model/data/user/part-r-00000-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet
./recommender_model/data/user/._SUCCESS.crc
./recommender_model/data/user/._common_metadata.crc
./recommender_model/data/user/part-r-00001-f8bf36d3-2145-4af2-9780-6271d68ea25c.gz.parquet
./recommender_model/data/user/_metadata
I would like to move this folder to another spark installation so that I train on one spark installation and use another spark installation for predictions.
Can I simply tar up this folder and unpack it to the other spark instance where I can load the model? E.g.
model = MatrixFactorizationModel.load(sc, './recommender_model')
my_movie = sc.parallelize([(0, 500)]) # Quiz Show (1994)
individual_movie_rating_RDD = model.predictAll(my_movie)
individual_movie_rating_RDD.collect()
I performed some basic testing and the approach of zipping up the whole folder and unzipping it on another machine worked well for me.