hdfs or hadoop command to sync the files or folder between local to hdfs - hadoop

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*

You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

Related

how hadoop directory differ from hadoop-x.x.x

I am new to hadoop and recently when I was running MapReduce jobs on Openstack hadoop cluster and cd into directory on a datanode machine, I found there are two hadoop folders one is called "hadoop" while the other named"hadoop-2.7.1". Obviously, the latter one makes more sense as it tells the hadoop version. The two folder contains same sub-directories, but how these two differ from each other? What if I'd like to disable HDFS permission checking on this machine, which one should I go?
Here is a screenshot
As colors in the screenshot suggest, hadoop is not a separate directory but is just a symbolic link, obviously pointing to hadoop-2.7.1. Run ls -l to check this.
You should cd into hadoop directory. It exists intentionally to avoid writing hadoop version explicitly. When new version of hadoop is deployed, a new versioned directory will be created, and hadoop symbolic link will be changed to point to the latest versioned directory. Like this:
hadoop-2.7.1
hadoop-2.7.2
hadoop-2.7.3
hadoop -> hadoop-2.7.3

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

Basic issue in copying files from hive or hadoop to local directory due to wrong nomenclature

I'm trying to copy a file that is hosted both within Hive and on an HDFS onto my local computer, but I can't seem to figure out the write call/terminology to use to invoke my local. All online explanations describe it solely as "path to local". I'm trying to copy this into a folder at C/Users/PC/Desktop/data and am using the following attempts:
In HDFS
hdfs dfs -copyToLocal /user/w205/staging /C/Users/PC/Desktop/data
In Hive
INSERT OVERWRITE LOCAL DIRECTORY 'c/users/pc/desktop/data/' SELECT * FROM lacountyvoters;
How should I be invoking the local repository in this case?

How to deploy & run Samza job on HDFS?

I want to get a Samza job running on a remote system with the Samza job being stored on HDFS. The example (https://samza.apache.org/startup/hello-samza/0.7.0/) for running a Samza job on a coal machine involves building a tar file, then unzipping the tar file, then running a shell script that's located within the tar file.
The example here for HDFS is not really well-documented at all (https://samza.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html). It says to copy the tar file to HDFS, then to follow the other steps in the non-HDFS example.
That would imply that the tar file that now resides on HDFS needs to be untarred within HDFS, then a shell script to be run on that unzipped tar file. But you can't untar a HDFS tar file with the hadoop fs shell...
Without untarring the tar file, you don't have access to run-job.sh to initiate the Samza job.
Has anyone managed to get this to work please?
We deploy our Samza jobs this way: we have hadoop libraries in /opt/hadoop, we have Samza sh scripts in /opt/samza/bin and we have Samza config file in /opt/samza/config. In this config file there is this line:
yarn.package.path=hdfs://hadoop1:8020/deploy/samza/samzajobs-dist.tgz
When we wanna deploy new version of our Samza job we just create the tgz archive, we move it (without untaring) to HDFS to /deploy/samza/ and we run /opt/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file:///opt/samza/config/$CONFIG_NAME.properties
The only downside is that we ignore config files in the archive. If you change the config in the archive it does not take an effect. You have to change the config files in /opt/samza/config. On the other side we are able to change config of our Samza job without deploying the new tgz archive. The shell scripts under /opt/samza/bin remains the same every build thus you don't need to untar the archive package because of the shell scripts.
Good luck with Samzing! :-)

How to make your mapper write on local file system in hadoop

I wish to write a file and create a directory in my local file system through m MapReduce code. Also if I create a directory in the working directory during the job execution, how can I move it to my local file system before the cleanup.
As your mapper runs on some/any machine in your cluster, of course you can use basic Java file operations to write files. You can use org.apache.hadoop.hdfs.DFSClient to access any files on the HDFS to copy to a local file (I'd suggest you copy inside the HDFS and fetch any files from it after the jobs are finished).
Of course your local files will be local to the client-machine (I assume separate machines), so something like NFS will be needed to let the written files be available to you on any client. Watch out for concurreny problems.
I'm interested as well on writing files locally on the datanode. For that, I used java.io.FileWriter and java.io.BufferedWriter:
FileWriter fstream = new FileWriter("log.out",true);
BufferedWriter bout = new BufferedWriter(fstream);
bout.append(build.toString());
bout.close();
It only creates the file when is executed through eclipse. When run as a .jar with the next command:
hadoop jar jarFile.jar Mainclass
it doesn't create anything. I don't know whether it is a problem of a misexecution, misconfiguration or just that sth is missing
Actually this is only to create a log file for debugging. The actual files I want the datanode to write locally are created through Runtime.getRuntime(). However, the same thing happens. If the execution is carried out through eclipse it's ok. Outside eclipse, it seems fine but no file is ever created.
Before doing it on a cluster it should work on a single node, so the whole thing is donde on a single computer for now.

Resources