Copy from Hadoop to local machine - hadoop

I can ssh to our box and do a hadoop fs -ls /theFolder and browse in for the files, etc.. but that's all I know too :) My goal is to copy one of those files - they are Avro - on to my local home folder.
How can do this? I found also a get command but not sure how to sue that either.

First, use hadoop fs -get /theFolder to copy it into the current directory you are ssh'ed into on your box.
Then you can use either scp or my preference of rsync to copy the files between your box and your local system like so. Here's how I'd use rsync after having used the -get, still in the same directory:
rsync -av ./theFolder username#yourlocalmachine:/home/username
This will copy theFolder from the local fs on your box into your home folder on your machine's fs. Be sure to replace username with your actual username in both cases, and yourlocalmachine with your machine's hostname or ip address.

Using hadoop's get you can copy the files from HDFS to your box's file system. Read more about using get here.
Then, using scp (this is similar to doing ssh) you may copy those files to your local system. Read more about using scp here.

hadoop fs -get theFolder
is great just like previous answer.
For syncing with local machine, I think you can set up git. That's easy as well.

Related

How to copy file from local directory in another drive to HDFS in Apache Hadoop?

I'm new to Apache Hadoop and I'm trying to copy a simple text file from my local directory to HDFS on Hadoop, which is up and running. However, Hadoop is installed in D: while my file is in C:.
If I use the -put or copyFromLocal command in cmd with the file in the aforementioned drive, it doesn't allow me to do that. However, if I place the text file in the same D: drive, the file is correctly uploaded to Hadoop and can be seen on Hadoop localhost. The code that works with the file and Hadoop in the same drive is as follows:
hadoop fs -put /test.txt /user/testDirectory
If my file is in a separate drive, I get the error '/test.txt': No such file or directory. I've tried variations of /C/pathOfFile/test.txt but to no avail, so in short, I need to know how to access a local file in another directory, specifically with respect to the -put command. Any help for this probably amateurish question will be appreciated.
If your current cmd session is in D:\, then your command would look at the root of that drive
You could try prefixing the path
file:/C:/test.txt
Otherwise, cd to the path containing your file first, then just -put test.txt or -put .\test.txt
Note: HDFS doesn't know about the difference between C and D unless you actually set fs.defaultFS to be something like file:/D:/hdfs
From your question I assume that you have installed Hadoop in a Virtual Machine (VM) on a Windows installation. Please provide more details on that if this assumption is incorrect. The issue is that your VM considers drive D: as the Local Directory, where -put and -copyFromLocal can see files at. C: is not visible to these commands currently.
You need to mount drive C: to your VM, in order to make its files available as local for Hadoop. There are guides out there depending on your VM. I advise care while at it, in order not to mishandle any Windows installation files.

Hadoop copyFromLocal: '.': No such file or directory

I use Windows 8 with a cloudera-quickstart-vm-5.4.2-0 virtual box.
I downloaded a text file as words.txt into the Downloads folder.
I changed directory to Downloads and used hadoop fs -copyFromLocal words.txt
I get the no such file or directory error.
Can anyone explain me why this is happening / how to solve this issue?
Here is a screenshot of the terminal:
Someone told me this error occurs when Hadoop is in safe mode, but I have made sure that the safe mode is OFF.
It's happening because hdfs:///user/cloudera doesn't exist.
Running hdfs dfs -ls probably gives you a similar error.
Without specified destination folder, it looks for ., the current HDFS directory for the UNIX account running the command.
You must hdfs dfs -mkdir "/user/$(whoami)" before your current UNIX account can use HDFS, or you can specify an otherwise existing HDFS location to copy to

How to edit txt file inside the HDFS in terminal?

Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.

Can't put file from local directory to HDFS

I have created a file with name "file.txt" in the local directory , now I want to put it in HDFS by using :-
]$ hadoop fs -put file.txt abcd
I am getting a response like
put: 'abcd': no such file or directory
I have never worked on Linux. Please help me out - How do I put the file "file.txt" into HDFS?
If you don't specify an absolute path in hadoop (HDFS or wathever other file system used), it will pre-append your user directory to create an absloute path.
By default, in HDFS you default folder should be /user/user name.
Then in your case you are trying to create the file /user/<user name>/abcd and put inside it the content of your local file.txt.
The user name is your operative system user, in your local machine. You can get it using the whoami command.
The the problem is that your user folder doesn't exist in HDFS, and you need to create it.
BTW, according with hadoop documentation, the correct command to work with HDFS is hdfs dfs instead hadoop fs (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html). But by now both should work.
Then:
If you don't know your user name in your local operative system. Open a terminal and run the whoami command.
Execute the follow command, replacing your user name.
hdfs dfs -mkdir -p /user/<user name>
And then you should be able to execute your PUT command.
NOTE: The -p parameter is to create the /user folder if it doesn't exist.

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Resources