how to save data in HDFS with spark? - hadoop

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir

The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

Related

how do you create a hdfs data directory?

everytime my hadoop server reboots, I have to format the namenode to start the hadoop. This removes all of the files in my hadoop installation.
I need to move my hadoop hdfs location from /tmp file to permenant location where whenever the server reboots, I don't have to format the namenode etc.
I am very new to hadoop.
How do I create a hdfs file in another directory?
How do I reference this data directory in config file so that I don't have to format the namenode?
These two properties of the hdfs-site.xml determine where local files are stored.
The defaults are under /tmp
dfs.namenode.name.dir
dfs.datanode.data.dir
You typically have to format a namenode only when the HDFS processes failed to terminate correctly (such as a power failure or forced shutdown). It is encouraged to run a standby Namenode to prevent these scenarios.

Loading data into Hive Table from HDFS in Cloudera VM

When using the Cloudera VM how can you access information in the HDFS? I know there isn't a direct path to the HDFS but I also don't see how to dynamically access it.
After creating a Hive Table through the Hive CLI I attempted to load some data from a file located in the HDFS:
load data inpath '/test/student.txt' into table student;
But then I just get this error:
FAILED: SemanticException Line 1:17 Invalid path ''/test/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/test/student.txt
I also tried to just load data not in the HDFS into a Hive Table like so:
load data inpath '/home/cloudera/Desktop/student.txt' into table student;
However that just produced this error:
FAILED: SemanticException Line 1:17 Invalid path ''/home/cloudera/Desktop/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/home/cloudera/Desktop/student.txt
Once again I see it trying to access data with the root of hdfs://quickstart.cloudera:8020 and I'm not sure what that is, but it doesn't seem to be the root directory for the HDFS.
I'm not sure what I'm doing wrong but I made sure the file is located in the HDFS so I don't know why this error is coming up or how to fix it.
how can you access information in the HDFS
Well, you certainly don't need to use Hive to do it. hdfs dfs commands are how you interact with HDFS.
I'm not sure what that is, but it doesn't seem to be the root directory for the HDFS
It is the root of HDFS. quickstart.cloudera is the hostname of the VM. Port 8020 is the HDFS port.
Your exceptions are from the difference in using the LOCAL keyword.
What you're doing
LOAD DATA INPATH <hdfs location>
VS what you seem to be wanting
LOAD DATA LOCAL INPATH <local file location>
Or if the files are in HDFS, it's not clear how you have put files into it, but HDFS definitely doesn't have a /home folder or a Desktop, so the second error at least makes sense.
Anyways, hdfs dfs -put /test/students.text /test/ is one way to upload your file, assuming the hdfs:///test folder already exists. Otherwise, hdfs dfs -put /test/students.text /test renames your file to /test on HDFS
Note: You can create an EXTERNAL TABLE over an HDFS directory, you don't need to use the LOAD DATA command.

hadoop cluster use HDFS as default.fs, Client wants to use file on local file system

My hadoop cluster is using HDFS as default fs. But on client side, the input file is located on local file system. (for some reason, I don't want to move that into HDFS). If I give mapreduce job a Uri like 'file:///opt/myDoc.txt', I've got a file not exists error. How can I get access to local file system in this case?
This is the place where you give the details of input path
FileInputFormat.setInputPaths(conf, new Path(args[0]));
Instead of args[0], give local path.
possible duplicate

Moving data to hdfs using copyFromLocal switch

I don't know what's going on here but I am trying to copy a simple file from a directory in my local filesystem to the directory specified for hdfs.
In my hdfs-site.xml I have specified that the directory for hdfs will be /home/vaibhav/Hadoop/dataNodeHadoopData using the following properties -
<name>dfs.data.dir</name>
<value>/home/vaibhav/Hadoop/dataNodeHadoopData/</value>
and
<name>dfs.name.dir</name>
<value>/home/vaibhav/Hadoop/dataNodeHadoopData/</value>
I am using the following command -
bin/hadoop dfs -copyFromLocal /home/vaibhav/ml-100k/u.data /home/vaibhav/Hadoop/dataNodeHadoopData
to copy the file u.data from it's local filesystem location to the directory that I specified as Hdfs directory. But when I do this, nothing happens - no error, nothing. And no file gets copied to the hdsf. Am I doing something wrong? Any permissions issue could be there?
Suggestions needed.
I am using pseudo distributed single node mode.
Also, on a related note, I want to ask that in my map reduce program I have set the configuration to point to the inputFilePath as /home/vaibhav/ml-100k/u.data. So would it not automatically copy the file from given location to hdfs ?
I believe dfs.data.dir and dfs.name.dir have to point to two different and existing directories. Furthermore make sure you have formatted the namenode FS after changing the directories in the configuration.
While copying to HDFS you're incorrectly specifying the target. The correct syntax for copying a local file to HDFS is:
bin/hadoop dfs -copyFromLocal <local_FS_filename> <target_on_HDFS>
Example:
bin/hadoop dfs -copyFromLocal /home/vaibhav/ml-100k/u.data my.data
This would create a file my.data in your user's home directory in HDFS.
Before copying files to HDFS make sure, you master listing directory contents and directory creation first.

Does a file need to be in HDFS in order to use it in distributed cache?

I get
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/path/to/my.jar, expected: hdfs://ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
if I try to add a local file to distributed cache in hadoop. When the file is on HDFS, I don't get this error (obviously, since it's using the expected FS). Is there a way to use a local file in distributed cache without first copying it to hdfs? Here is a code snippet:
Configuration conf = job.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path dependency = fs.makeQualified(new Path("/local/path/to/my.jar");
DistributedCache.addArchiveToClassPath(path, conf);
Thanks
It has to be in HDFS first. I'm going to go out on a limb here, but I think it is because the file is "pulled" to the local distributed cache by the slaves, not pushed. Since they are pulled, they have no way to access that local path.
No, I don't think you can put anything on the distributed cache without it being in HDFS first. All Hadoop jobs use input/output path in relation to HDFS.
File can be either in local system, hdfs, S3 or other cluster also. You need to specify as
-files hdfs:// if the file is in hdfs
by default it assumes local file system.

Resources