Force update fsimage in Hadoop? - hadoop

I have a homework assignment that asks us to convert a fsimage in XML format to TSV. I want to set up a certain file structure in my HDFS and observe the differences in the exported file through OIV, however I can't seem to find a way to force HDFS to dump a new fsimage. How can I force my namenode to dump an update version of the fsimage?

Related

Hive doesn't read latest HDFS file data until I close OutputStream

I use Spark Streaming to append data to HDFS file. For each batch, I write data and then call hsync to make sure data is persistent.
When execute hadoop fs -cat command, I can see my update is persistent, however if I execute select count(*) from table in Hive, the total count will not change until I close the OutputStream.
Also I noticed, when execute hadoop fs -ls command, HDFS file last modification time and file size will not change until close the OutputStream.
How it happens? Is there a way to let Hive read new data but don't close OutputStream?
Edited#20180811
I guess the problem is, before file closed, the file size is not sync to namenode. when MR job launched, it get the file size from namenode which is not the actual size, and tell MR job the read the size, so the new appended data are not read.
Is there config properties to avoid this?
The reason is when init MR job, client will get file length from name node and calculate file split. When file is not close, file length is not sync to name node, so client don't know there is new data. I customize file input format to verify this problem, # https://github.com/yantzu/hendeavour

Hadoop or Spark read tar.bzip2 read

How can I read tar.bzip2 file in spark in parallel.
I have created a java hadoop custom reader that read the tar.bzip2 file but it is taking too much time to read file as only one core is being used and after some time application failed because only one executor get all the data.
So as we know bzipped files are splittable so when reading a bzipped into an RDD the data will get distributed across the partitions. However the underlying tar file will also get distributed across the partitions and it is not splittable therefore if you try and perform an operation on a partition you will just see a lot of binary data.
To solve this I simply read the bzipped data into an RDD with a single partition. I then wrote this RDD out to a directory, so now you have only a single file containing all the tar file data. I then pulled this tar file from hdfs down to my local file system and untarred it.

How do i get generated filename when calling the Spark SaveAsTextFile method

I'am new to Spark, Hadoop and all what comes with. My global need is to build a real-time application that get tweets and store them on HDFS in order to build a report based on HBase.
I'd like to get the generated filename when calling saveAsTextFile RRD method in order to import it to Hive.
Feel free to ask for further informations and thanks in advance.
saveAsTextFile will create a directory of sequence files. So if you give it path "hdfs://user/NAME/saveLocation", a folder called saveLocation will be created filled with sequence files. You should be able to load this into HBase simply by passing the directory name to HBase (sequenced files are a standard in Hadoop).
I do recommend you look into saving as a parquet though, they are much more useful than standard text files.
From what I understand, You saved your tweets to hdfs and now want the file names of those saved files. Correct me if I'm wrong
val filenames=sc.textfile("Your hdfs location where you saved your tweets").map(_._1)
This gives you an array of rdd's into filenames onto which you could do your operations. Im a newbie too to hadoop, but anyways...hope that helps

Update file in HDFS through Hue

I know that HDFS is about write once and read multiple times type. As per my knowledge it's not possible to update a file (randomly) in HDFS because a file is stored in a distributed environment (as blocks) with each block replicated on other nodes which would be difficult for data-node to update even these replicated blocks.
But, my question is it possible to update files in HDFS using hue tool?. Because, I've updated many files (stored in HDFS) using the hue tool and ran map-reduce jobs on it. So, how is it possible for hue to update the files in HDFS. Does hue do something in background? Are the updates made through hue really updated to same file? or hue deletes the file and re-writes the whole file (including our new data we want to update)?
Hue deletes and re-writes the whole file as HDFS does not support editions. You can notice that Hue limits the edition to only small files for now.
Here is a blog post to learn more about the HDFS Filebrowser.

how to delete some data from hdfs file in Hadoop

I'd uploaded 50GB data on Hadoop cluster. But Now i want to delete first row of data file.
This is time consuming if i remove that data & change manually. Then upload it again on HDFS.
Please reply me.
HDFS files are immutable (for all practical purposes).
You need to upload the modified file(s). You can do the change programatically with a M/R job that does a near-identity transformation, eg. running a streaming shell script that does sed, but the gist of it that you need to create new files, HDFS files cannot be edited.

Resources