Spark yarn-cluster mode - read file passed with --files - hadoop

I'm running my spark application using yarn-cluster master.
What does the app do?
External service generates a jsonFile based on HTTP request to a RESTService
Spark needs to read this file and do some work after parsing the json
Simplest solution that came to mind was to use --files to load that file.
In yarn-cluster mode reading a file means it must be available on hdfs (if I'm right?) and my file is being copied to path like this:
/hadoop_user_path/.sparkStaging/spark_applicationId/myFile.json
Where I can of course read it, but I cannot find a way to get this path from any configuration / SparkEnv object. And hardcoding .sparkStaging in spark code seamed like a bad idea.
Why simple:
val jsonStringData = spark.textFile(myFileName)
sqlContext.read.json(jsonStringData)
cannot read file passed with --files and throws FileNotFoundException? Why is spark looking for files in hadoop_user_folder only?
My solution which works for now:
Just before running spark, I copy file to proper hdfs folder, pass the filename as Spark argument, process the file from a known path and after the job is done I delete the file form hdfs.
I thought passing the file as --files would let me forget about saving and deleting this file. Something like pass-process-andforget.
How do you read a file passed with --files then? The only solution is with creating path by hand, hardcoding ".sparkStaging" folder path?

The question is written very ambiguously. However, from what I seem to get is that you want to read a file from any location of your Local OS File System, and not just from HDFS.
Spark uses URI's to identify paths, and in the availability of a valid Hadoop/HDFS Environment, it will default to HDFS. In that case, to point to your Local OS FileSystem, in the case of for example UNIX/LINUX, you can use something like:
file:///home/user/my_file.txt
If you are using an RDD to read from this file, you run in yarn-cluster mode, or the file is accessed within a task, you will need to take care of copying and distributing that file manually to all nodes in your cluster, using the same path. That is what it makes it easy of first putting it on hfs, or that is what the --files option is supposed to do for you.
See more info on Spark, External Datasets.
For any files that were added through the --files option, or were added through SparkContext.addFile, you can get information about their location using the SparkFiles helper class.

Answer from #hartar worked for me. Here is the complete solution.
add required files during spark-submit using --files
spark-submit --name "my_job" --master yarn --deploy-mode cluster --files /home/xyz/file1.properties,/home/xyz/file2.properties --class test.main /home/xyz/my_test_jar.jar
get spark session inside main method
SparkSession ss = new SparkSession.Builder().getOrCreate();
Since i am interested only in .properties files, i am filtering it, instead if you know the file name which you wish to read then it can be directly used in FileInputStream.
spark.yarn.dist.files would have stored it as file:/home/xyz/file1.properties,file:/home/xyz/file2.properties hence splitting the string by (,) and (/) so that i can eliminate the rest of the content except the file name.
String[] files = Pattern.compile("/|,").splitAsStream(ss.conf().get("spark.yarn.dist.files")).filter(s -> s.contains(".properties")).toArray(String[]::new);
//load all files to Property
for (String f : files) {
props.load(new FileInputStream(f));
}

I had the same problem as you, in fact, you must know that when you send an executable and files, these are at the same level, so in your executable, it is enough that you just put the file name to Access it since your executable is based on its own folder.
You do not need to use sparkFiles or any other class. Just the method like readFile("myFile.json");

I have come across an easy way to do it.
We are using Spark 2.3.0 on Yarn in pseudo distributed mode. We need to query a postgres table from spark whose configurations are defined in a properties file.
I passed the property file using --files attribute of spark submit. To read the file in my code I simply used java.util.Properties.PropertiesReader class.
I just need to ensure that the path I specify when loading file is same as that passed in --files argument
e.g. if the spark submit command looked like:
spark-submit --class --master yarn --deploy-mode client--files test/metadata.properties myjar.jar
Then my code to read the file will look like:
Properties props = new Properties();
props.load(new FileInputStream(new File("test/metadata.properties")));
Hope you find this helpful.

Related

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.
Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Make files available locally on Elastic MapReduce

The Hadoop documentation states it's possible to make files available locally by use of the -file option.
How can I do this using the Elastic MapReduce Ruby CLI?
You could use the DistributedCache with EMR to do this.
With the ruby client this can be done with the following option:
`--cache <path_to_file_being_cached#name_in_current_working_dir>`
It places a single file in the DistributedCache. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir (I think).
You can then access the files in your Mapper/Reducer tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job); in the setup method of your tasks.
An example to do this in the Ruby client taken from Amazon's forums:
./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

How to make your mapper write on local file system in hadoop

I wish to write a file and create a directory in my local file system through m MapReduce code. Also if I create a directory in the working directory during the job execution, how can I move it to my local file system before the cleanup.
As your mapper runs on some/any machine in your cluster, of course you can use basic Java file operations to write files. You can use org.apache.hadoop.hdfs.DFSClient to access any files on the HDFS to copy to a local file (I'd suggest you copy inside the HDFS and fetch any files from it after the jobs are finished).
Of course your local files will be local to the client-machine (I assume separate machines), so something like NFS will be needed to let the written files be available to you on any client. Watch out for concurreny problems.
I'm interested as well on writing files locally on the datanode. For that, I used java.io.FileWriter and java.io.BufferedWriter:
FileWriter fstream = new FileWriter("log.out",true);
BufferedWriter bout = new BufferedWriter(fstream);
bout.append(build.toString());
bout.close();
It only creates the file when is executed through eclipse. When run as a .jar with the next command:
hadoop jar jarFile.jar Mainclass
it doesn't create anything. I don't know whether it is a problem of a misexecution, misconfiguration or just that sth is missing
Actually this is only to create a log file for debugging. The actual files I want the datanode to write locally are created through Runtime.getRuntime(). However, the same thing happens. If the execution is carried out through eclipse it's ok. Outside eclipse, it seems fine but no file is ever created.
Before doing it on a cluster it should work on a single node, so the whole thing is donde on a single computer for now.

Resources