Make files available locally on Elastic MapReduce - ruby

The Hadoop documentation states it's possible to make files available locally by use of the -file option.
How can I do this using the Elastic MapReduce Ruby CLI?

You could use the DistributedCache with EMR to do this.
With the ruby client this can be done with the following option:
`--cache <path_to_file_being_cached#name_in_current_working_dir>`
It places a single file in the DistributedCache. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir (I think).
You can then access the files in your Mapper/Reducer tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job); in the setup method of your tasks.
An example to do this in the Ruby client taken from Amazon's forums:
./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list

Related

Is there a way to load the install-interpreter.sh file in EMR in order to load 3rd party interpreters?

I have an Apache Zeppelin notebook running and I'm trying to load the jdbc and/or postgres interpreter to my notebook in order to write to a postgres DB from Zeppelin.
The main resource to load new interpreters here tells me to run the code below to get other interpreters:
./bin/install-interpreter.sh --all
However, when I run this command in EMR terminal, I find that the EMR cluster does not come with an install-interpreter.sh executable file.
What is the recommended path?
1. Should I find the install-interpreter.sh file and load that to the EMR cluster under ./bin/?
2. Is there an EMR configuration on start time that would enable the install-interpreter.sh file?
Currently all tutorials and documentations assumes that you can run the install-interpreter.sh file.
The solution is to not run this code below in root (aka - ./ )
./bin/install-interpreter.sh --all
Instead in EMR, run the code above in Zeppelin, which in the EMR cluster, is in /usr/lib/zeppelin

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.
Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

Spark yarn-cluster mode - read file passed with --files

I'm running my spark application using yarn-cluster master.
What does the app do?
External service generates a jsonFile based on HTTP request to a RESTService
Spark needs to read this file and do some work after parsing the json
Simplest solution that came to mind was to use --files to load that file.
In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:
/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json
Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.
Why simple:
val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)
cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?
My solution which works for now:
Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.
I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.
How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?
The question is written very ambiguously. However, from what I seem to get is that you want to read a file from any location of your Local OS File System, and not just from HDFS.
Spark uses URI's to identify paths, and in the availability of a valid Hadoop/HDFS Environment, it will default to HDFS. In that case, to point to your Local OS FileSystem, in the case of for example UNIX/LINUX, you can use something like:
file:///home/user/my_file.txt
If you are using an RDD to read from this file, you run in yarn-cluster mode, or the file is accessed within a task, you will need to take care of copying and distributing that file manually to all nodes in your cluster, using the same path. That is what it makes it easy of first putting it on hfs, or that is what the --files option is supposed to do for you.
See more info on Spark, External Datasets.
For any files that were added through the --files option, or were added through SparkContext.addFile, you can get information about their location using the SparkFiles helper class.
Answer from #hartar worked for me. Here is the complete solution.
add required files during spark-submit using --files
spark-submit --name "my_job" --master yarn --deploy-mode cluster --files /home/xyz/file1.properties,/home/xyz/file2.properties --class test.main /home/xyz/my_test_jar.jar
get spark session inside main method
SparkSession ss = new SparkSession.Builder().getOrCreate();
Since i am interested only in .properties files, i am filtering it, instead if you know the file name which you wish to read then it can be directly used in FileInputStream.
spark.yarn.dist.files would have stored it as file:/home/xyz/file1.properties,file:/home/xyz/file2.properties hence splitting the string by (,) and (/) so that i can eliminate the rest of the content except the file name.
String[] files = Pattern.compile("/|,").splitAsStream(ss.conf().get("spark.yarn.dist.files")).filter(s -> s.contains(".properties")).toArray(String[]::new);
//load all files to Property
for (String f : files) {
props.load(new FileInputStream(f));
}
I had the same problem as you, in fact, you must know that when you send an executable and files, these are at the same level, so in your executable, it is enough that you just put the file name to Access it since your executable is based on its own folder.
You do not need to use sparkFiles or any other class. Just the method like readFile("myFile.json");
I have come across an easy way to do it.
We are using Spark 2.3.0 on Yarn in pseudo distributed mode. We need to query a postgres table from spark whose configurations are defined in a properties file.
I passed the property file using --files attribute of spark submit. To read the file in my code I simply used java.util.Properties.PropertiesReader class.
I just need to ensure that the path I specify when loading file is same as that passed in --files argument
e.g. if the spark submit command looked like:
spark-submit --class --master yarn --deploy-mode client--files test/metadata.properties myjar.jar
Then my code to read the file will look like:
Properties props = new Properties();
props.load(new FileInputStream(new File("test/metadata.properties")));
Hope you find this helpful.

Issue with Hadoop and Google Cloud Storage Connector

I've deployed a hadoop cluster via Deployments interface in google console. (Hadoop 2.x)
My task was to filter data stored in one Google Storage (GS) bucket and put the results to another. So, this is a map only job with simple python script. Note that cluster and output bucket are in the same zone (EU).
Leveraging Google Cloud Storage Connector, I run the following streaming job:
hadoop jar /home/hadoop/hadoop-install/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar \
-D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.job.reduces=0 \
-file file_1 \
-file mymapper.py \
-input gs://inputbucket/somedir/somedir2/*-us-* \
-output gs://outputbucket/somedir3/somedir2 \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper mymapper.py
What happens is all the mappers process data and store the results in temporary directory in GS, which looks like:
gs://outputbucket/somedir3/somedir2/_temporary/1/mapper-0000/part-0000.gz
After all mappers are finished, job progress hangs at 100% map, 0% reduce. Looking at output bucket with gsutil, I see that result files are being copied to the destination directory:
gs://outputbucket/somedir3/somedir2
This process takes a very long time and kills the whole benefit from using hadoop.
My questions are:
1) Is it a known issue or I just done something wrong? I couldn't find any relevant info.
2) Am I correct saying that normally hdfs would move those files to destination dir, but GS can't perform move and thus the files are copied?
3) What can I do to avoid this pattern?
You're almost certainly running into the "Slow FileOutputCommitter" issue which affects Hadoop 2.0 through 2.6 inclusive and is fixed in 2.7.
If you're looking for a nice managed Hadoop option on Google Cloud Platform, you should consider Google Cloud Dataproc (documentation here), where we maintain our distro to ensure we pick up patches relevant to Google Cloud Platform quickly. Dataproc indeed configures the mapreduce.fileoutputcommitter.algorithm.version so that the final commitJob is fast.
For something more "do-it-yourself", you can user our command-line bdutil tool , which also has the latest update to use the fast FileOutputCommitter.

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only.
What am I doing:
Deploy a cluster with
bdutil deploy -e datastore_env.sh
On the master:
vgorelik#vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
java.io.IOException: Multiple IOExceptions.
java.io.IOException: Multiple IOExceptions.
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false leads to "Error retrieving object" error.
Is it possible to use read-only buckets as MR job input somehow?
I use gsutil for files upload. Can it be forced to create these empty objects on file upload?
Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.
For example, I have run this job many times:
./hadoop-install/bin/hadoop \
jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
-input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
-mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
-output gs://big-data-roadshow/output
This accesses the same read-only bucket you mention in your example above.
The difference between our examples is that mine ends with a glob (*), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.
I recommend you use gsutil to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop command.
The answer to your second question ("Can gsutil be forced to create these empty objects on file upload") is currently "no".

Resources