Save a h2o.ai model to S3 bucket in python - h2o

I have been using the command below to save my h2O model into a s3 bucket in python3 (I am using amazon EMR):
h2o.save_model(model=best_gbm1,path='s3://bucketname/folder1/folder2', force=False)
but I do get the following error:
H2OServerError: HTTP 500 Server Error:
Server error java.lang.RuntimeException:
Error: Not implemented
Request: None
Do you know if it is possible to save a H2O model directly to a S3 bucket?

I ran into the same problem in version 3.26.0.6 and solved it by changing the protocol to s3a instead of s3:
h2o.save_model(model=best_gbm1,path='s3a://bucketname/folder1/folder2', force=False)
I.e. use the object-based S3 overlay instead of the block-based one.

The ability to save an h2o-3 model to s3 via the save_model() is not currently supported (as of version 3.22.1.6).

Related

How to set User-Agent (prefix) for every upload request to S3 from Amazon EMR application

AWS has requested that the product I'm working on identifies requests that it makes to our users' S3 resources on their behalf so they can assess its impact.
To accomplish this, we have to set the User-Agent header for every upload request done against a S3 bucket from an EMR application. I'm wondering how this can be achieved?
Hadoop's doc mentions the fs.s3a.user.agent.prefix property (core-default.xml). However, the protocol s3a seems to be deprecated (Work with Storage and File Systems), so I'm not sure if this property will work.
To give a bit of more context what I need to do, with AWS Java SDK, it is possible to set the User-Agent header's prefix, for example:
AWSCredentials credentials;
ClientConfiguration conf = new ClientConfiguration()
.withUserAgentPrefix("APN/1.0 PARTNER/1.0 PRODUCT/1.0");
AmazonS3Client client = new AmazonS3Client(credentials, conf);
Then, every request's User-Agent http header will has a value similar to: APN/1.0 PARTNER/1.0 PRODUCT/1.0, aws-sdk-java/1.11.234 Linux/4.15.0-58-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.201-b09 java/1.8.0_201. I need to achieve something similar when uploading files from an EMR application.
S3A is not deprecated in ASF hadoop; I will argue that it is now ahead of what EMR's own connector will do. If you are using EMR you may be able to use it, otherwise you get to work with what they implement.
FWIW in S3A we're looking at what it'd take to actually dynamically change the header for a specific query, so you go beyond specific users to specific hive/spark queries in shared clusters. Be fairly complex to do this though as you need to do it on a per request setting.
The solution in my case was to include a awssdk_config_default.json file inside the JAR submitted to EMR job. This file it used by AWS SDK to allow developers to override some custom settings.
I've added this json file within the JAR submitted to EMR with this content:
{
"userAgentTemplate": "APN/1.0 PARTNER/1.0 PRODUCT/1.0 aws-sdk-{platform}/{version} {os.name}/{os.version} {java.vm.name}/{java.vm.version} java/{java.version}{language.and.region}{additional.languages} vendor/{java.vendor}"
}
Note: passing the fs.s3a.user.agent.prefix property to EMR job didn't work. AWS EMR uses EMRFS when handling files stored in S3 which uses AWS SDK. I realized it because of an exception thrown in AWS EMR that I see sometimes, part of its stack trace was:
Caused by: java.lang.ExceptionInInitializerError: null
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:144)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:93)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:616)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:825)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:217)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
I'm posting the answer here to future references. Some interests links:
The class in AWS SDK that uses this configuration file: InternalConfig.java
https://stackoverflow.com/a/31173739/1070393
EMRFS

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

How to write a stream to Google Cloud Storage?

I want to write a file in gcs with a stream object but I've only found the "create_file" function that creates a new file object by providing a path to a local file to upload and the path to store it with in the bucket.
Is there any function to create a file in gcs from a stream?
Fuse over GCS
You could try gcsfuse which layers a user-space fs over a bucket but it is only beta s/ware at present. There's a nice section on limitations which you should read first.
I use fog to access GCS but that is a thin layer which doesn't try to impose any additional semantics into the bucket/object model.
Warning, if your problem really requires a standard file-system underneath any possible solution then GCS is not a good fit.
The ability to provide an IO object instead of a File object has only recently been possible. It was added in PR 1335, and will be included in the next release.
Until then, quickest way is to write the stream to a tempfile and upload that. For more see Issue 305.

Storing data in CouchbaseServer(without metadata)

I have created data in couchabse lite db and replicated it in couchabse server , but in replication unused data is also get created on server. is there any method to store pure data (without metadata) ?
is bucket shodowing is usefull for this problem ?
You can use couch base server 5.1 there the extra meta data is stored in XAttrs. And the document will not have meta data inside the document. If required this meta data can be found in the X- attributes.
For that you will need to set up the sync gateway in a manner where one sync gateway out of the cluster should have import_docs:"continuous" and all the sync gateway should have enable_shared_bucket_access:true.
By this change in the sync gateway configuration by using sync gateway 1.5 or 2.0 you will be able implement this functionality.
Also the good thing is, if the data is even changed directly on the server it will also flow to the devices.

Determine S3 Bucket Region

I'm using the AWS Ruby SDK v2 to access various buckets across a couple of regions. Is it possible to determine the region of (at runtime) of each bucket before I access it, so I can avoid the error below, which I get if I configure the AWS S3 client with the wrong region?
The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
I know I can shell out and use the command below and parse the response, but ideally, I want to stay within the Ruby SDK.
aws s3api get-bucket-location
I couldn't find any official document for this, but from the aws-sdk spec you should be able to use following code to get the region
client = Aws::S3::Client.new()
resp = client.get_bucket_location(bucket: bucket_name)
s3_region = resp.data.location_constraint
This one is calling the same API as aws s3api get-bucket-location
For aws-sdk (2.6.5), it is:
client.get_bucket_location(bucket: bucket_name).location_constraint
But then, now, how we get a list of buckets that belong to a specific region?

Resources