Hadoop filesystem reads linux filesystem instead of hdfs? - hadoop

I have a strange thing happening, when I read hadoop filesystem it shows me linux filesystem not the hadoop one, anyone is familiar with this issue?
Thanks,
Mika

This will happen if a valid hadoop configuration is not found.
e.g. if you do:
hadoop fs -ls
and there is no configuration is found at the default location, then you will see the linux filesystem. You can test this by adding either the -conf option after the "hadoop" command e.g.
hadoop -conf=<path-to-conf-files> fs -ls

Related

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

hadoop file system change directory command

I was going through the HADOOP fs commands list. I am little perplexed not to find any "cd" command in hadoop fs.
Why is it so? It might sound silly question for the HADOOP users, but as I am beginner I can not understand why there is no list of cd command in HADOOP fs level?
Think about it like this:
Hadoop has a special file system called "hdfs" which runs on top of existing say linux file system. There is no concept of current or present working directory a.k.a. pwd
Let's say we have following structure in hdfs:
d1/
d2/
f1
d3/
f2
d4/
f3
You could do cd in your Linux file system from moving from one to the other but do you think changing directory in hadoop would makes sense? HDFS is like virtual file system and you dont directly interact with hdfs except via hadoop command or job tracker.
HDFS provides various features that enable accessing HDFS(Hadoop Filesystem) easy on local machines or edge nodes. You have an option to mount HDFS using any of the following methods. Once Hadoop file system is mounted on your machine, you may use cd command to browse through the file system (It's is like mounting remote network filesystem like NAS)
Fuse dfs (Available from Hadoop 0.20 onwards )
NFSv3 Gateway access to HDFS data (Available from Hadoop version
Hadoop 2.2.0)

error while running any hadoop hdfs file system command

I am very new to hadoop. I am referring "hadoop for dummies" book.
I have setup a vm with following specs
hadoop version 2.0.6-alpha
bigtop
os centos
problem is while running any hdfs file system command I am getting following error
example command : hadoop hdfs dfs -ls
error : Could not find or load main class hdfs
Please advice
Regards,
Try running:
hadoop fs -ls
or
hdfs dfs -ls
what do they return?
fs and dfs are the same commands.
Difference between `hadoop dfs` and `hadoop fs`
Remove either hadoop or hdfs and the command should run.

How to move Word and PDF documents to Hadoop HDFS?

I want to copy/upload some files from a local system (a system not in Hadoop cluster) onto Hadoop HDFS. The local system can be Windows system too.
I tried with Flume spool directory. It works fine with Text files. For other docs, the mime type is getting corrupted.
Please let me know different approaches to load a file(s) to HDFS.
hadoop fs -copyFromLocal <localsrc> URI
Check Hadoop documentation: copyFromLocal
Keep in mind, Apache Flume wasn't created to copy some files.
You can also use hadoop fs -put <localsrcpath> <hdfspath>
This is one of the alternative to copyFromLocal
In hadoop 2.0 (YARN) you can do as follows to transfer local files to HDFS:
hdfs dfs -put "localsrcpath" "hdfspath"
where hdfs is the command located in the bin directory.
Java code can do that easily. You don't require any tools for this. Check below, the piece of code that worked:
Configuration conf = new Configuration();
try {
conf.set("fs.defaultFS",<<namenode>>); //something like hdfs://server:9000 or copy from core-site.xml
FileSystem fileSystem= FileSystem.get(conf);
System.out.println("Uploading please wait...");
fileSystem.copyFromLocalFile(false, new Path(args[0]), new Path(args[1].trim()));//args[0]=C://file or dir args[1]=/imported
Prepare jar out of this and run on any OS. Keep in mind you no need to
have Hadoop running in the machine, where you are going to run this
code. If you need any help, add comments.
Don't forget to add dnsresolver line where you run this code. Open /drivers/etc/hosts (for Windows)
hadoopnamenode ip-address
slavenode ip-address
First you need to load docs from your Windows machine to linux machine using filezilla or other tool.
And then you need to use:
hadoop fs -put localsrcpath hdfspath
Following command will also work.
hadoop fs -copyFromLocal localsrcpath hdfspath

Hadoop dfs -ls returns list of files in my hadoop/ dir

I've set up a sigle-node Hadoop configuration running via cygwin under Win7. After starting Hadoop bybin/start-all.sh I run bin/hadoop dfs -ls which returns me a list of files in my hadoop directory. Then I run bin/hadoop datanode -formatbin/hadoop namenode -format but -ls still returns me the contents of my hadoop directory. As far as I understand it should return nothing(empty folder). What am I doing wrong?
Did you edit the core-site.xml and mapred-site.xml under conf folder ?
It seems like your hadoop cluster is in local mode.
I know this question is quite old, but directory structure in Hadoop has changed a bit (version 2.5 )
Jeroen's current version would be.
hdfs dfs -ls hdfs://localhost:9000/users/smalldata
Also Just for information - use of start-all.sh and stop-all.sh has been deprecated, instead one should use start-dfs.sh and start-yarn.sh
I had the same problem and solved it by explicitly specifying the URL to the NameNode.
To list all directories in the root of your hdfs space do the following:
./bin/hadoop dfs -ls hdfs://<ip-of-your-server>:9000/
The documentation says something about a default hdfs point in the configuration, but I cannot find it. If someone knows what they mean please enlighten us.
This is where I got the info: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#Overview
Or you could just do:
Run stop-all.sh.
Remove dfs data and name directories
Namenode -format
Run start-all.sh

Resources