Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.
Related
I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename
I'm new to Apache Hadoop and I'm trying to copy a simple text file from my local directory to HDFS on Hadoop, which is up and running. However, Hadoop is installed in D: while my file is in C:.
If I use the -put or copyFromLocal command in cmd with the file in the aforementioned drive, it doesn't allow me to do that. However, if I place the text file in the same D: drive, the file is correctly uploaded to Hadoop and can be seen on Hadoop localhost. The code that works with the file and Hadoop in the same drive is as follows:
hadoop fs -put /test.txt /user/testDirectory
If my file is in a separate drive, I get the error '/test.txt': No such file or directory. I've tried variations of /C/pathOfFile/test.txt but to no avail, so in short, I need to know how to access a local file in another directory, specifically with respect to the -put command. Any help for this probably amateurish question will be appreciated.
If your current cmd session is in D:\, then your command would look at the root of that drive
You could try prefixing the path
file:/C:/test.txt
Otherwise, cd to the path containing your file first, then just -put test.txt or -put .\test.txt
Note: HDFS doesn't know about the difference between C and D unless you actually set fs.defaultFS to be something like file:/D:/hdfs
From your question I assume that you have installed Hadoop in a Virtual Machine (VM) on a Windows installation. Please provide more details on that if this assumption is incorrect. The issue is that your VM considers drive D: as the Local Directory, where -put and -copyFromLocal can see files at. C: is not visible to these commands currently.
You need to mount drive C: to your VM, in order to make its files available as local for Hadoop. There are guides out there depending on your VM. I advise care while at it, in order not to mishandle any Windows installation files.
I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.
But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.
Can any one suggest me some methods which one would be the best method to proceed?
If there are any errors in my question please correct me.
You can pipe it directly from a download to avoid writing it to disk, e.g.:
curl server.com/my/file | hdfs dfs -put - destination/file
The - parameter to -put tells it to read from stdin (see the documentation).
This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.
I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.
I tried using flume directory spooler source and HDFS sink. But this does not serve my purpose because, the files are read by Flume and then get written to HDFS as part files which can be rolled by size/time (Please correct me if I've got this wrong). Is there a tool that continously does something like an HDFS put on all files that are dumped in the spool directory ?
If i got your question correctly then you have a and you are getting files into it and that file you want to move to HDFS without reading it and HDFS copyFromLocal will solve your issue then you just need to have an logic which can return you the recent files in the directory and run CopyFromLocal command to copy it in HDFS.