How to make your mapper write on local file system in hadoop - hadoop

I wish to write a file and create a directory in my local file system through m MapReduce code. Also if I create a directory in the working directory during the job execution, how can I move it to my local file system before the cleanup.

As your mapper runs on some/any machine in your cluster, of course you can use basic Java file operations to write files. You can use org.apache.hadoop.hdfs.DFSClient to access any files on the HDFS to copy to a local file (I'd suggest you copy inside the HDFS and fetch any files from it after the jobs are finished).
Of course your local files will be local to the client-machine (I assume separate machines), so something like NFS will be needed to let the written files be available to you on any client. Watch out for concurreny problems.

I'm interested as well on writing files locally on the datanode. For that, I used java.io.FileWriter and java.io.BufferedWriter:
FileWriter fstream = new FileWriter("log.out",true);
BufferedWriter bout = new BufferedWriter(fstream);
bout.append(build.toString());
bout.close();
It only creates the file when is executed through eclipse. When run as a .jar with the next command:
hadoop jar jarFile.jar Mainclass
it doesn't create anything. I don't know whether it is a problem of a misexecution, misconfiguration or just that sth is missing
Actually this is only to create a log file for debugging. The actual files I want the datanode to write locally are created through Runtime.getRuntime(). However, the same thing happens. If the execution is carried out through eclipse it's ok. Outside eclipse, it seems fine but no file is ever created.
Before doing it on a cluster it should work on a single node, so the whole thing is donde on a single computer for now.

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

How do I run map/reduce on files in local file system?

How do I run a Java map/reduce job on files available in local file system? For instance, I have a 3 node cluster, and all the nodes have a log file in their local file system, say /home/log/log.txt.
How do I run a job on these files? Do I need to combine them and transfer it to HDFS before running the job?
Thanks.
You can upload all the individual files under one folder and provide that folder path as input path to your map reduce program. Your Map Reduce runs on all the files in that folder.

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Run a Local file system directory as input of a Mapper in cluster

I gave an input to the mapper from a local filesystem.It is running successfully from eclipse,But not running from the cluster as it is unable to find the local input path saying:input path does not exist.Please can anybody help me how to give a local file path to a mapper so that it can run in the cluster and i can get the output in hdfs
This is a very old question. Recently faced the same issue.
I am not aware of how correct this solution is it worked for me though. Please bring to notice if there are any drawbacks of this.Here's what I did.
Reading a solution from the mail-archives, I realised if i modify fs.default.name from hdfs://localhost:8020/ to file:/// it can access the local file system. However, I didnt want this for all my mapreduce jobs. So I made a copy of core-site.xml in a local system folder (same as the one from where I would submit my MR jar to hadoop jar).
and in my Driver class for MR I added,
Configuration conf = new Configuration();
conf.addResource(new Path("/my/local/system/path/to/core-site.xml"));
conf.addResource(new Path("/usr/lib/hadoop-0.20-mapreduce/conf/hdfs-site.xml"));
The MR takes input from local system and writes the output to hdfs:
Running in a cluster requires the data to be loaded into distributed storage (HDFS). Copy the data to HDFS first using hadoop fs -copyFromLocal and then try to trun your job again, giving it the path of the data in HDFS
The question is an interesting one. One can have data on S3 and access this data without an explicit copy to HDFS prior to running the job. In the wordcount example, one would specify this as follows:
hadoop jar example.jar wordcount s3n://bucket/input s3n://bucket/output
What occurs in this is that the mappers read records directly from S3.
If this can be done with S3, why wouldn't hadoop similarly, using this syntax instead of s3n
file:///input file:///output
?
But empirically, this seems to fail in an interesting way -- I see that Hadoop gives a file not found exception for a file that is indeed in the input directory. That is, it seems to be able to list the files in the put directory on my local disk but when it comes time to open them to read the records, the file is not found (or accessible).
The data must be on HDFS for any MapReduce job to process it. So even if you have a source such as local File System or a network path or a web based store (such as Azure Blob Storage or Amazon Block stoage), you would need to copy the data at HDFS first and then run the Job.
The bottom line is that you would need to push the data first to to HDFS and there are several ways depend on data source, you would perform the data transfer from your source to HDFS such as from local file system you would use the following command:
$hadoop -f CopyFromLocal SourceFileOrStoragePath _HDFS__Or_directPathatHDFS_
Try setting the input path like this
FileInputFormat.addInputPath(conf, new Path(file:///the directory on your local filesystem));
if you give the file extension, it can access files from the localsystem
I have tried the following code and got the solution...
Please try it and let me know..
You need to get FileSystem object for local file system and then use makequalified method to return path.. As we need to pass path of local filesystem(no other way to pass this to inputformat), i ve used make qualified, which in deed returns only local file system path..
The code is shown below..
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Path inputPath = fs.makeQualified(new Path("/usr/local/srini/")); // local path
FileInputFormat.setInputPaths(job, inputPath);
I hope this works for your requirement, though it's posted very late.. It worked fine for me.. It does not need any configuration changes i believe..
U might wanna try this by setting the configuration as
Configuration conf=new Configuration();
conf.set("job.mapreduce.tracker","local");
conf.set("fs.default.name","file:///");
After this u can set the fileinputformat with the local path and u r good to go

Resources