How do I run a Java map/reduce job on files available in local file system? For instance, I have a 3 node cluster, and all the nodes have a log file in their local file system, say /home/log/log.txt.
How do I run a job on these files? Do I need to combine them and transfer it to HDFS before running the job?
Thanks.
You can upload all the individual files under one folder and provide that folder path as input path to your map reduce program. Your Map Reduce runs on all the files in that folder.
Related
I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.
I want to create an Oozie workflow to transfer an HDFS file from an HDFS cluster to another server.
Since Oozie can run commands or scripts on any node in a system, is it possible to run a shell script or SFTP on one of the nodes and transfer the file to the destination server.
I think this task can be easily done by performing, from the remote server, a http GET (open operation) on the HDFS file (you can use curl for that).
Anyway, if you want to do it through Oozie, I think you can create a script in charge of moving the desired file from HDFS to the local file system, and then perform a scp in order to move the file within the local file system to the remote file system.
All the jar files required by my MapReduce are present at local file system of the hadoop nodes (/home/ubuntu/libs/).
During the execution these are referred as hdfs://namenode-ip-address:namenodeport-number:/home/ubuntu/libs and not from the local file system of the node.
How can I resolve it?
I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}
I wish to write a file and create a directory in my local file system through m MapReduce code. Also if I create a directory in the working directory during the job execution, how can I move it to my local file system before the cleanup.
As your mapper runs on some/any machine in your cluster, of course you can use basic Java file operations to write files. You can use org.apache.hadoop.hdfs.DFSClient to access any files on the HDFS to copy to a local file (I'd suggest you copy inside the HDFS and fetch any files from it after the jobs are finished).
Of course your local files will be local to the client-machine (I assume separate machines), so something like NFS will be needed to let the written files be available to you on any client. Watch out for concurreny problems.
I'm interested as well on writing files locally on the datanode. For that, I used java.io.FileWriter and java.io.BufferedWriter:
FileWriter fstream = new FileWriter("log.out",true);
BufferedWriter bout = new BufferedWriter(fstream);
bout.append(build.toString());
bout.close();
It only creates the file when is executed through eclipse. When run as a .jar with the next command:
hadoop jar jarFile.jar Mainclass
it doesn't create anything. I don't know whether it is a problem of a misexecution, misconfiguration or just that sth is missing
Actually this is only to create a log file for debugging. The actual files I want the datanode to write locally are created through Runtime.getRuntime(). However, the same thing happens. If the execution is carried out through eclipse it's ok. Outside eclipse, it seems fine but no file is ever created.
Before doing it on a cluster it should work on a single node, so the whole thing is donde on a single computer for now.