I expect 100's of data files in XML, Excels, delimited formats which I am converting to AVRO on a weekly basis. Would you suggest maintaining a backup of source files in original format in HDFS under a folder backup OR a folder under local file system, which is not HDFS.
The files are sourced from FTP.
Once the conversion process is successful then since HDFS is hosting the AVRO files then it takes care of Backup assuming you setup the replication factor according to your needs. At this point, keeping the source files in HDFS is unnecessary. May be a Tape backup is what is optimal at this point.
Related
I am new to the HDFS system and I come across a HDFS question.
We have a HDFS file system, with the namenode on a server (with this server named as 0002) and datanode on two other servers (with these two severs named as 0004 and 0005 respectively). The original data comes from a Flume application and with the "Sink" in the Flume as HDFS. The Flume will write the original data (txt files) into the datanode on the servers 0004 and 0005.
So, the original data is replicated twice and saved under two servers. The system work well for some time until one day there is a power outage. When restarting the servers, the datanode servers (0004 and 0005) are restarted before the namenode (0002) server. In this case, the original data is still saved onto the 0004 and 0005 server, however the metadata information on the namenode(0002) is lost. The block information become corrupt. The question is how to fix the corrupt blocks without losing the original data?
For example, when we check on the namenode
hadoop fsck /wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv -files -blocks -locations
We find the filename on the datanode, but the block is corrupt. The corresponding file name is:
blk_1090579409_16840906
When we go to the datanode (e.g. 0004) server, we can search the location of this file by
find ./ -name "*blk_1090579409*"
We have found the the file corresponding to the csv file under the virtual path of the HDFS system "/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv". The file is saved under folder: "./subdir0/subdir235/" and we can open it and find it is in the correct format. The corresponding .meta data is in binary form(?) and we can not read directly.
./subdir0/subdir235/blk_1090579409
The question is, given that we have found the original file (blk_1090579409), how could we restore the corrupt HDFS system using and without losing these correct original files?
After some research, I find a solution, which may be not efficient but works. If someone comes with a better solution, please let me know.
The whole idea is to copy all the files from the HDFS, arrange these files by year/day/hour/minute into different folders and then update these folders onto HDFS.
I have two datanodes (0004 and 0005) where data is stored. The total data size is of the magnitude of 10+ terabytes. The folder structure is as following (it is the same as in the question, one displayed on linux and the other on Windows):
The replication factor is set as 2, which means (if no mistake happens) that each datanode will have one and only one copy of the original file. Therefore, we just scan the folders/files on one datanode (on server 0004, about 5+ terabytes). Based on the modification date and the timestep in each file, copy the files into a new folder on a backup server/driver. Luckily, in the original files, time step information is available, e.g. 2020-03-02T09:25. I round the time to the nearest five minutes, and parent folder is for each day, with the newly created folder structure as:
The code of scan and copy the files from the datanode into the new folders by each five minutes are written in Pyspark and it takes about 2 days (I leave the code to run in the evening) to run all the operation.
Then I can update the folders on HDFS for each day. On HDFS, the folder structure is as follows:
The created folder is with the same structure as on the HDFS, also the naming convention is the same (in the copy step, I rename each copied files to match the convention on HDFS).
In the final step, I write JAVA code in order to do operations to the HDFS. After some testing, I am able to update the data of each day on HDFS. I.e. It will delete e.g. the data under the folder ~/year=2019/month=1/day=2/ on HDFS and then upload all the folders/files under the newly created folder ~/20190102/ up to ~/year=2019/month=1/day=2/ on HDFS. I do such operation for each day. Then the corrupt blocks disappear, and the right files are uploaded to the correct path on HDFS.
According to my research, it is also possible to find the corrupt blocks before the power outage by using the fsimage file on Hadoop, but this means that I may corrupt the blocks on HDFS after the power outage. Therefore, I decide using the described approach to delete the corrupt blocks while still keeping the original files and update them on HDFS.
If anyone has a better or more efficient solution, please share!
I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.
I am working on a Spark Application that has to read multiple directories (i.e. multiple paths) from S3 Bucket and HDFS. I read that newHadoopAPI provides a great way to read Lzo compressed / indexed files in a good performant way. But, how do we read multiple folder paths / directories have several Lzo files and Index files in an RDD using newHadoopAPI?
The folder structure is like partitioned Hive Table on two columns.
Ex: as below. Partition on date and batch
/rootDirectory/date=20161002/batch=5678/001_0.lzo
/rootDirectory/date=20161002/batch=5678/001_0.lzo.index
/rootDirectory/date=20161002/batch=5678/002_0.lzo
/rootDirectory/date=20161002/batch=5678/002_0.lzo.index
/rootDirectory/date=20161002/batch=8765/001_0.lzo
/rootDirectory/date=20161002/batch=8765/001_0.lzo.index
/rootDirectory/date=20161002/batch=8765/002_0.lzo
/rootDirectory/date=20161002/batch=8765/002_0.lzo.index
..... and so on.
Now I use the below code to read data from S3. This treats both Lzo and Lzo.Index files as input which crashes my application, as I dont want to read .lzo.index files, but just the .lzo files using the index for speed.
val impInput = sparkSession.sparkContext.newAPIHadoopFile("s3://my-bucket/myfolder/*/*", classOf[NonSplittableTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
val impRDD = impInput.map(_._2.toString)
Could anyone please help me to understand how can I do that?
1). Read all (mulitple) folders under the root for the Lzo files using the newHadoopAPI so that I can utilize the .index file for my benefit.
2). Read the data from HDFS in the similar fashion.
Adding a suffix to your HDFS path may help.
val impInput = sparkSession.sparkContext.newAPIHadoopFile("s3://my-bucket/myfolder/*/*.lzo", classOf[NonSplittableTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
I want to move some files from one location to another location [both the locations are on HDFS] and need to verify that the data has moved correctly.
In order to compare the data moved, I was thinking of calculating hash code on both the files and then comparing if they are equal. If equal, I would term the data movement as correct else the data movement has not happened correctly.
But I have a couple of questions regarding this.
Do I need to use the hashCode technique at all in first place? I am using MapR distribution and I read somewhere that data movement when done, implement hashing of the data at the backend and make sure that it has been transferred correctly. So is it guaranteed that when data will be moved inside HDFS, it will be consistent and no anomaly will be inserted while movement?
Is there any other way that I can use in order to make sure that the data moved is consistent across locations?
Thanks in advance.
You are asking about data copying. Just use DistCp.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
#sample example
$hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.
EDIT
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting.
After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to ·verify that the copy was truly successful·. Since DistCp employs both MapReduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy.
EDIT
The common method I used to check the source and dist files was check the number of files and the specified size of each file. This can be done by generate a manifest at the source, then check at the dist both the number and size.
In HDFS, move doesn't physically move the data (blocks) across the data nodes. It actually changes the name space in HDFS metadata. Where as copying data from one HDFS location to another HDFS location, we have two ways;
copy
parallel copy distcp
In General copy, it doesn't check the integrity of blocks. If you want data integrity while copying file from one location to another location in same HDFS cluster use CheckSum concept by modifying the FsShell.java class or write your own class using HDFS Java API.
In Case of distCp, HDFS checks the data integrity while copying data from One HDFS cluster to another HDFS Cluster.
I'd uploaded 50GB data on Hadoop cluster. But Now i want to delete first row of data file.
This is time consuming if i remove that data & change manually. Then upload it again on HDFS.
Please reply me.
HDFS files are immutable (for all practical purposes).
You need to upload the modified file(s). You can do the change programatically with a M/R job that does a near-identity transformation, eg. running a streaming shell script that does sed, but the gist of it that you need to create new files, HDFS files cannot be edited.