How do you transfer files onto the Hadoop FS (HDFS) on WIndows cmdline without Cygwin? - windows

I have zero experience with Hadoop, but suddenly have to use it at work with Spark on Windows. My question, which has been asked a few times here, but I never could quite get the syntax for what I need, is this. I'm trying to transfer a simple file called:
gensortText.txt which let's say is at c:\gensortText.txt
I know you can use hadoop fs -copyFromLocal. I've tried these things:
hadoop fs -copyFromLocal C:\gensortText.txt hdfs://0.0.0.0:19000
ERROR: Relative path in absolute URI.
hadoop fs -copyFromLocal C:\gensortOutText.txt \tmp\hadoop-Administrator\dfs
ERROR: copyFromLocal: `tmphadoop-Administratordfs': No such file or directory
and a number of other variations with hdfs: and using the tmp directory which all returned similar errors.
I have hadoop in c:\deploy as suggested in the Hadoop2Windows guide (which works and allowed me to run Hadoop. I can access the WebGui and all that). Hadoop has created my new HDFS at c:\temp. Please someone help me figure out how to transfer files into the system. It can even be manually if that's possible, but that doesn't seem to work as it doesn't show up in the Web GUI when I go to "Utilities->Browse the Filesystem". Nothing shows up there actually.
Can someone please help. Any information that's relevant I can provide, but I'm so new to this I don't really know what would be helpful. I think it's just my syntax for the cmdline tool. Can someone give me a concrete example of how to use hadoop -fs copyFromLocal or another simple way to do this? Sorry for my ignorance on the subject, and thanks for any help

To be able to run hadoop commands on Windows you need to have winutils installed and visible to hadoop process.

Related

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

Cloudera Hadoop : File reading/ writing in HDFS

I have this scala and Java code running in Spark on Cloudera platform whose simple task is to perform Word count on the files in HDFS. My question is : What's the difference in reading the file with this code snippet -
sc.textFile("hdfs://quickstart.cloudera:8020/user/spark/InputFile/inputText.txt")
as opposed to reading from local drive over cloudera platform?
sc.textFile("/home/cloudera/InputFile/inputText.txt")
Is it not that in both cases the file is saved using HDFS and wouldn't make any difference reading/ writing either ways? These both read/write to HDFS, right? I referred this thread, but no clue.
Cloudera Quickstart VM illegalArguementException: Wrong FS: hdfs: expected: file:
Could you please tell me at least a single case where using hdfs:// implies something else?
Thank You!
As per my knowledge,
sc.textFile("hdfs://quickstart.cloudera:8020/user/spark/InputFile/inputText.txt") in this line hdfs://quickstart.cloudera:8020 refers to HDFS directory or file /user/spark/InputFile/inputText.txt.
sc.textFile("/home/cloudera/InputFile/inputText.txt") in this line '/home/cloudera/InputFile/inputText.txt' refers to your local unix/linux file system.
So if you want to use/read/write into HDFS file then you need to use hdfs://namenodeHost:port as per hadoop configuration.
Hope this clarify your doubt !!

How can I get spark to access local HDFS on windows?

I have installed both hadoop and spark locally on a windows machine.
I can access HDFS files in hadoop, e.g.,
hdfs dfs -tail hdfs:/out/part-r-00000
works as expected. However, if I try to access the same file from the spark shell, e.g.,
val f = sc.textFile("hdfs:/out/part-r-00000")
I get an error that the file does not exist. Spark can access files in the windows file system using the file:/... syntax, though.
I have set the HADOOP_HOME environment variable to c:\hadoop which is the folder containing the hadoop install (in particular winutils.exe, which seems to be necessary for spark, is in c:\hadoop\bin).
Because it seems that HDFS data is stored in the c:\tmp folder, I was wondering whether there is would be a way to let spark know about this location.
Any help would be greatly appreciated. Thank you.
If you are getting file doesn't exist, that means your spark application(code snippet) is able to connect to HDFS.
The HDFS file path that you are using seems wrong.
This should solve your issue
val f = sc.textFile("hdfs://localhost:8020/out/part-r-00000")

Error: -copyFromLocal: java.net.UnknownHostException

I am new at Java, Hadoop etc.
I am having a problem when trying to copy a file to HDFS.
It says: "-copyFromLocal: java.net.UnknownHostException: quickstart.cloudera (...)"
How can I solve this? It is a exercise. You can see the problem in the imagem below.
Image with the problem
Image 2 with the error
Thank you very much.
As error says you need to supply the HDFS folder path as destination. So the code should be like:
hadoop fs -copyFromLocal words.txt /HDFS/Folder/Path
Almost all errors that you get while working in Hadoop are Java errors as MapReduce was mostly written in Java. But that doesnt mean there is some Java error in it.

HDFS: FileSystem.exists(path) return false on existing resource?

I am having in a locally running Hadoop HDFS (my work station is name/data node) difficulties to access files.
In my HDFS I have a file located in the folder "/huser/data.txt"
I can confirm with hdfs dfs -ls /huser that the file exists.
I create the FileSystem by calling FileSystem.get(uri, config), uri being hdfs://localhost:9000
If I call the exist method of org.apache.hadoop.fs.FileSystem I get always a false as return value.
I tried various parameter combinations, but I am wondering what I am doing wrong:
fs.exists(new Path("hdfs:/huser/data.txt"))
fs.exists(new Path("hdfs://huser/data.txt"))
both don't work.
I tried also using a MiniDFSCluster to provide a minimal working example, but unfortunately it works there. I seem to have an issue with a live HDFS and accessing the files (Hadoop 2.6).
Maybe I'm a bit late but i stumpled upon your question while googling. For anyone having this issue, the solution is to not include hdfs in your path
fs.exists(new Path("/huser/data.txt)
This should work

Resources