Rename files created in hadoop - Spark [duplicate] - hadoop

This question already has an answer here:
HDFS: move multiple files using Java / Scala API
(1 answer)
Closed 5 years ago.
The files created in HDFS via write has its own naming convention. To change it to custom name there is an option via script using hadoop fs -mv oldname newname
Is there any other option available in Spark/ Hadoop to provide custom name to created file.

Apache Spark does not provide any Api for file system operations in hdfs. But you can always use the Hadoop file system APIs to rename the file in HDFS. Check this for more details of the Hadoop file system APIs available. For renaming , the following will work :
val conf = new Configuration();
val fileSystem = FileSystem.get(conf);
fileSystem.mkdir(new Path(newhdfs_dirPath));
fileSystem.rename(new Path(existinghdfs_dirpath+oldname), new Path(newhdfs_dirPath+newname));

Related

How do I get a working directory of Spark executor in Java? [duplicate]

This question already exists:
Copy files (config) from HDFS to local working directory of every spark executor
Closed 5 years ago.
I need to know a current working directory URI/URL of Spark executor so I can copy some dependencies there before job executes. How do I get in Java ? What api should I call?
Working directory is application specific so you want be able to get it before applications starts. It is best to use standard Spark mechanisms:
--jars / spark.jars - for JAR files.
pyFiles - for Python dependencies.
SparkFiles / --files / --archives - for everything else

What is the similar function to Distributed cache of Hadoop Distribution File system in Google File System

I have deployed a 6-node Hadoop Cluster in Google Compute Engine.
I am using Google file system(GFS) instead of Hadoop File Distribution System(HFS).
.
So, I want to access files in GFS in the same way as distributed cache method does in HDFS
Please tell me a way to access files this way.
When running Hadoop on Google Compute Engine with the Google Cloud Storage connector for Hadoop as the "default filesystem", the GCS connector is able to be treated exactly the same way HDFS is treated, including for usage in the DistributedCache. So, to access files in Google Cloud Storage, you'd use it exactly the same way you would use HDFS, no need to change anything. For example, if you had deployed your cluster with your GCS connector's CONFIGBUCKET set to foo-bucket, and you had local files you wanted to place in the DistributedCache, you'd do:
# Copies mylib.jar into gs://foo-bucket/myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
And in your Hadoop job:
JobConf job = new JobConf();
// Retrieves gs://foo-bucket/myapp/mylib.jar as a cached file.
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
If you want to access files in a different bucket than your CONFIGBUCKET, you just need to specify a full path, using gs:// instead of hdfs://:
# Copies mylib.jar into gs://other-bucket/myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mylib.jar gs://other-bucket/myapp/mylib.jar
and then in Java
JobConf job = new JobConf();
// Retrieves gs://other-bucket/myapp/mylib.jar as a cached file.
DistributedCache.addFileToClassPath(new Path("gs://other-bucket/myapp/mylib.jar"), job);

Hadoop: I want to know Path to hdfs

I want to open a file in Hadoop File System using a Java Program. I wanted to know how the path to HDFS look like and how to specify it in a Java Program?
To get all the details of HDFS , its files , content in your java code use the Hadoop fs api.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html

How to move Word and PDF documents to Hadoop HDFS?

I want to copy/upload some files from a local system (a system not in Hadoop cluster) onto Hadoop HDFS. The local system can be Windows system too.
I tried with Flume spool directory. It works fine with Text files. For other docs, the mime type is getting corrupted.
Please let me know different approaches to load a file(s) to HDFS.
hadoop fs -copyFromLocal <localsrc> URI
Check Hadoop documentation: copyFromLocal
Keep in mind, Apache Flume wasn't created to copy some files.
You can also use hadoop fs -put <localsrcpath> <hdfspath>
This is one of the alternative to copyFromLocal
In hadoop 2.0 (YARN) you can do as follows to transfer local files to HDFS:
hdfs dfs -put "localsrcpath" "hdfspath"
where hdfs is the command located in the bin directory.
Java code can do that easily. You don't require any tools for this. Check below, the piece of code that worked:
Configuration conf = new Configuration();
try {
conf.set("fs.defaultFS",<<namenode>>); //something like hdfs://server:9000 or copy from core-site.xml
FileSystem fileSystem= FileSystem.get(conf);
System.out.println("Uploading please wait...");
fileSystem.copyFromLocalFile(false, new Path(args[0]), new Path(args[1].trim()));//args[0]=C://file or dir args[1]=/imported
Prepare jar out of this and run on any OS. Keep in mind you no need to
have Hadoop running in the machine, where you are going to run this
code. If you need any help, add comments.
Don't forget to add dnsresolver line where you run this code. Open /drivers/etc/hosts (for Windows)
hadoopnamenode ip-address
slavenode ip-address
First you need to load docs from your Windows machine to linux machine using filezilla or other tool.
And then you need to use:
hadoop fs -put localsrcpath hdfspath
Following command will also work.
hadoop fs -copyFromLocal localsrcpath hdfspath

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

Resources