rdd.saveAsTextFile doesn't seem to work, but repetitions throw FileAlreadyExistsException

rdd.saveAsTextFile doesn't seem to work, but repetitions throw FileAlreadyExistsException - hadoop

I'm running dataFrame.rdd.saveAsTextFile("/home/hadoop/test") in an attempt to write a data frame to disk. This executes with no errors, but the folder is not created. Furthermore, when I run the same command again (in the shell) an Exception is thrown:
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/feet already exists
Any idea why this is? Is there a nuance of the submission move (client, cluster) that affects this?
EDIT:
I have permission to create directories in /home/hadoop but I cannot create directories inside any of the dirs/sub-dirs created by rdd.saveAsTextFile("file:/home/hadoop/test"). The structure looks like this:
/home/hadoop/test/_temporary/0
How are _temporary and 0 being created if I do not have permission to create directories inside test from the command line? Is there a way to change the permission of these created directories?
Edit2:
In the end I wrote to s3 instead using rdd.coalesce(1).saveAsTextFile("s3://..."). This is only viable if you have a very small output - because coalesce(n) will cause the RDD to exist and be processed further on only n workers. In my case, I chose 1 worker so that the file would be generated by one worker. This gave me a folder containing one part-00000 file which had all of my data.

Since https://spark-project.atlassian.net/browse/SPARK-1100 saveAsTextFile should never be able to silently overwrite an already existing folder.
If you receive an java.io.IOException: Mkdirs failed to create file:... it probably means you have permission problems when trying to write in the output path.
If you give more context info the answers could be more helpful.
Like: are you running on local shell? cluster shell? which type of cluster?
EDIT: I think you are facing that error because all executors are trying to write to same same path which isn't available on all executors.

saveAsTextFile works. It writes to the default file system (configured by fs.default.name in your core-site.xml). In this case the default file system is hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/.
If you want to write to local disk, use saveAsTextFile("file:/home/hadoop/test"). If you have more than one node in the Spark cluster, the results will be mostly unusable: each node will write some parts of the RDD to local disk. But for testing this may be okay.

Related

How to place a file directly in HDFS without using local by directly download a file from a webpage?

I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.
But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.
Can any one suggest me some methods which one would be the best method to proceed?
If there are any errors in my question please correct me.

You can pipe it directly from a download to avoid writing it to disk, e.g.:
curl server.com/my/file | hdfs dfs -put - destination/file
The - parameter to -put tells it to read from stdin (see the documentation).
This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.

If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us

Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

How to tell distcp to ignore "file not found ..." and fall through to the next files?

We have a full HDFS backup using distcp that takes a long time to run, some of the data on HDFS is "moving", that is it is created and deleted. This results in mappers failing with java.io.FileNotFoundException: No such file or directory. Such files are unimportant, we just want the backup to do the best it can.
Now it seems that -i "ignore failures" is not quite what we want because it will ignore at the map level rather than the file level, that is if a map task fails all files associated to that map task will be ignored. What we want is just that file to be ignored.

Flume use-case - push data from read-only folder to HDFS

I am looking for a way to push log data from a read-only folder to hdfs using flume. as I know, flume spoolDir needs write access to change the completed file name when done, so I wanted to create a temp folder as a spoolDir and use rsync to copy files to it and then use it as a spoolDir.
but, as much as I know, once the file is changed on the dest folder by flume (myfile.COMPLETED) the rsync process will copy it again, right?
Any other solution?

An alternative source is the ExecSource. You can run a tail command on a single read-only file and start processing the data. Nevertheless, you must have into account this is an unreliable source since there is no way to recover from an error while putting the data into the agent channel.

What is the difference between moveFromLocal v/s put and CopyToLocal v/s get in hadoop hdfs command

Basically what is the major difference between moveFromLocal and copyToLocal instead of using put and get command in CLI of hadoop.

moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s copied.
copyToLocal: Similar to get command, except that the destination is restricted to a local file reference.
Source.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

rdd.saveAsTextFile doesn't seem to work, but repetitions throw FileAlreadyExistsException - hadoop

Related

How to place a file directly in HDFS without using local by directly download a file from a webpage?

Combine Map output for directory to one file

How to tell distcp to ignore "file not found ..." and fall through to the next files?

Flume use-case - push data from read-only folder to HDFS

What is the difference between moveFromLocal v/s put and CopyToLocal v/s get in hadoop hdfs command

Categories

Resources