how to delete some data from hdfs file in Hadoop - hadoop

I'd uploaded 50GB data on Hadoop cluster. But Now i want to delete first row of data file.
This is time consuming if i remove that data & change manually. Then upload it again on HDFS.
Please reply me.

HDFS files are immutable (for all practical purposes).
You need to upload the modified file(s). You can do the change programatically with a M/R job that does a near-identity transformation, eg. running a streaming shell script that does sed, but the gist of it that you need to create new files, HDFS files cannot be edited.

Related

Small files in hadoop

I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.

Hadoop or Spark read tar.bzip2 read

How can I read tar.bzip2 file in spark in parallel.
I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.
So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.
To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.

How do i get generated filename when calling the Spark SaveAsTextFile method

I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps

Update file in HDFS through Hue

I know that HDFS is about write once and read multiple times type. As per my knowledge it's not possible to update a file (randomly) in HDFS because a file is stored in a distributed environment (as blocks) with each block replicated on other nodes which would be difficult for data-node to update even these replicated blocks.
But, my question is it possible to update files in HDFS using hue tool?. Because, I've updated many files (stored in HDFS) using the hue tool and ran map-reduce jobs on it. So, how is it possible for hue to update the files in HDFS. Does hue do something in background? Are the updates made through hue really updated to same file? or hue deletes the file and re-writes the whole file (including our new data we want to update)?
Hue deletes and re-writes the whole file as HDFS does not support editions. You can notice that Hue limits the edition to only small files for now.
Here is a blog post to learn more about the HDFS Filebrowser.

Backing up source data files in hadoop

I expect 100's of data files in XML, Excels, delimited formats which I am converting to AVRO on a weekly basis. Would you suggest maintaining a backup of source files in original format in HDFS under a folder backup OR a folder under local file system, which is not HDFS.
The files are sourced from FTP.
Once the conversion process is successful then since HDFS is hosting the AVRO files then it takes care of Backup assuming you setup the replication factor according to your needs. At this point, keeping the source files in HDFS is unnecessary. May be a Tape backup is what is optimal at this point.

Resources