Our company is migrating from s3 to GCS. While the command-line utility gsutil works fine, I am facing difficulty in configuring Hadoop (core-site.xml) to enable access to GCS. This google page https://storage.googleapis.com/hadoop-conf/gcs-core-default.xml lists the name-value pairs that need to be added, but I don't find any of these in the ~/.boto file. The .boto file only has the following set:
gs_oauth2_refresh_token under [Credentials]
default_project_id under [GSUtil]
Few others like api_version etc..
The [OAuth2] section is empty.
Can I somehow generate the necessary keys using gs_oauth2_refresh_token and add them to Hadoop config? Or can I get these from any other gsutil config files?
For hadoop configuration you'll likely want to use a service-account rather than gsutil credentials that are associated with an actual email address; see these instructions for manual installation of the GCS connector for more details about setting up a p12 keyfile along with the other necessary configuration parameters.
Related
AWS has requested that the product I'm working on identifies requests that it makes to our users' S3 resources on their behalf so they can assess its impact.
To accomplish this, we have to set the User-Agent header for every upload request done against a S3 bucket from an EMR application. I'm wondering how this can be achieved?
Hadoop's doc mentions the fs.s3a.user.agent.prefix property (core-default.xml). However, the protocol s3a seems to be deprecated (Work with Storage and File Systems), so I'm not sure if this property will work.
To give a bit of more context what I need to do, with AWS Java SDK, it is possible to set the User-Agent header's prefix, for example:
AWSCredentials credentials;
ClientConfiguration conf = new ClientConfiguration()
.withUserAgentPrefix("APN/1.0 PARTNER/1.0 PRODUCT/1.0");
AmazonS3Client client = new AmazonS3Client(credentials, conf);
Then, every request's User-Agent http header will has a value similar to: APN/1.0 PARTNER/1.0 PRODUCT/1.0, aws-sdk-java/1.11.234 Linux/4.15.0-58-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.201-b09 java/1.8.0_201. I need to achieve something similar when uploading files from an EMR application.
S3A is not deprecated in ASF hadoop; I will argue that it is now ahead of what EMR's own connector will do. If you are using EMR you may be able to use it, otherwise you get to work with what they implement.
FWIW in S3A we're looking at what it'd take to actually dynamically change the header for a specific query, so you go beyond specific users to specific hive/spark queries in shared clusters. Be fairly complex to do this though as you need to do it on a per request setting.
The solution in my case was to include a awssdk_config_default.json file inside the JAR submitted to EMR job. This file it used by AWS SDK to allow developers to override some custom settings.
I've added this json file within the JAR submitted to EMR with this content:
{
"userAgentTemplate": "APN/1.0 PARTNER/1.0 PRODUCT/1.0 aws-sdk-{platform}/{version} {os.name}/{os.version} {java.vm.name}/{java.vm.version} java/{java.version}{language.and.region}{additional.languages} vendor/{java.vendor}"
}
Note: passing the fs.s3a.user.agent.prefix property to EMR job didn't work. AWS EMR uses EMRFS when handling files stored in S3 which uses AWS SDK. I realized it because of an exception thrown in AWS EMR that I see sometimes, part of its stack trace was:
Caused by: java.lang.ExceptionInInitializerError: null
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:144)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:93)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:616)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:825)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:217)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
I'm posting the answer here to future references. Some interests links:
The class in AWS SDK that uses this configuration file: InternalConfig.java
https://stackoverflow.com/a/31173739/1070393
EMRFS
Is there a way to configure the names of the files exported from Logging?
Currently the file exported includes colons. This are invalid characters as a path element in hadoop, so PySpark for instance cannot read these files. Obviously the easy solution is to rename the files, but this interferes with syncing.
Is there a way to configure the names or change them to no include colons? Any other solutions are appreciated. Thanks!
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md
At this time, there is no way to change the naming convention when exporting log files as this process is automated on the backend.
If you would like to request to have this feature available in GCP, I would suggest creating a PIT. This page allows you to report bugs and request new features to be implemented within GCP.
I am trying to use GCS as a storage for my files in Parse Server. I followed the tutorial, installing the parse-server-gcs-adapter and setting the environment variables in my .bashrc:
export PARSE_SERVER_FILES_ADAPTER=parse-server-gcs-adapter
export GCP_PROJECT_ID=my-project-id
export GCP_KEYFILE_PATH=path-to-keyfile.json
export GCS_BUCKET=my-bucket-name
export GCS_DIRECT_ACCESS=true
I can upload files in my Parse Dashboard, and they are correctly saved in the class, but the files cannot be seen in the bucket browser.
Some sources such as this one talk about a config file, but I cannot find info about this file anywhere.
I want to know if there is any means to debug what is happening, or if there is anything obvious that I am missing.
After some more research I could get the solution for the issue.
The first thing (debug what was happening) was to take a look (tail -f) at the log file, located at /opt/bitnami/apps/parse/htdocs/logs/parse.log. There, I could see nothing was happening, thus confirming my suspicion about not changing the right file. Then, I located the config file that I was looking for to be /opt/bitnami/apps/parse/htdocs/server.js, being able to configure it as written in the Parse docs.
As I had another problem with the GCS Adapter, I also found an issue with Bitnami machines, that was identical to this answer to an issue in Github. Now it's all working.
Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in
I would like to access external data from my aws ec2 instance.
In more detail: I would like to specify inside by user-data the name of a folder containing about 2M of binary data. When my aws instance starts up, I would like it to download the files in that folder and copy them to a specific location on the local disk. I only need to access the data once, at startup.
I don't want to store the data in S3 because, as I understand it, this would require storing my aws credentials on the instance itself, or passing them as userdata which is also a security risk. Please correct me if I am wrong here.
I am looking for a solution that is both secure and highly reliable.
which operating system do you run ?
you can use an elastic block storage. it's like a device you can mount at boot (without credentials) and you have permanent storage there.
You can also sync up instances using something like Gluster filesystem. See this thread on it.