I know that HDFS is about write once and read multiple times type. As per my knowledge it's not possible to update a file (randomly) in HDFS because a file is stored in a distributed environment (as blocks) with each block replicated on other nodes which would be difficult for data-node to update even these replicated blocks.
But, my question is it possible to update files in HDFS using hue tool?. Because, I've updated many files (stored in HDFS) using the hue tool and ran map-reduce jobs on it. So, how is it possible for hue to update the files in HDFS. Does hue do something in background? Are the updates made through hue really updated to same file? or hue deletes the file and re-writes the whole file (including our new data we want to update)?
Hue deletes and re-writes the whole file as HDFS does not support editions. You can notice that Hue limits the edition to only small files for now.
Here is a blog post to learn more about the HDFS Filebrowser.
Related
I am new to the HDFS system and I come across a HDFS question.
We have a HDFS file system, with the namenode on a server (with this server named as 0002) and datanode on two other servers (with these two severs named as 0004 and 0005 respectively). The original data comes from a Flume application and with the "Sink" in the Flume as HDFS. The Flume will write the original data (txt files) into the datanode on the servers 0004 and 0005.
So, the original data is replicated twice and saved under two servers. The system work well for some time until one day there is a power outage. When restarting the servers, the datanode servers (0004 and 0005) are restarted before the namenode (0002) server. In this case, the original data is still saved onto the 0004 and 0005 server, however the metadata information on the namenode(0002) is lost. The block information become corrupt. The question is how to fix the corrupt blocks without losing the original data?
For example, when we check on the namenode
hadoop fsck /wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv -files -blocks -locations
We find the filename on the datanode, but the block is corrupt. The corresponding file name is:
blk_1090579409_16840906
When we go to the datanode (e.g. 0004) server, we can search the location of this file by
find ./ -name "*blk_1090579409*"
We have found the the file corresponding to the csv file under the virtual path of the HDFS system "/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv". The file is saved under folder: "./subdir0/subdir235/" and we can open it and find it is in the correct format. The corresponding .meta data is in binary form(?) and we can not read directly.
./subdir0/subdir235/blk_1090579409
The question is, given that we have found the original file (blk_1090579409), how could we restore the corrupt HDFS system using and without losing these correct original files?
After some research, I find a solution, which may be not efficient but works. If someone comes with a better solution, please let me know.
The whole idea is to copy all the files from the HDFS, arrange these files by year/day/hour/minute into different folders and then update these folders onto HDFS.
I have two datanodes (0004 and 0005) where data is stored. The total data size is of the magnitude of 10+ terabytes. The folder structure is as following (it is the same as in the question, one displayed on linux and the other on Windows):
The replication factor is set as 2, which means (if no mistake happens) that each datanode will have one and only one copy of the original file. Therefore, we just scan the folders/files on one datanode (on server 0004, about 5+ terabytes). Based on the modification date and the timestep in each file, copy the files into a new folder on a backup server/driver. Luckily, in the original files, time step information is available, e.g. 2020-03-02T09:25. I round the time to the nearest five minutes, and parent folder is for each day, with the newly created folder structure as:
The code of scan and copy the files from the datanode into the new folders by each five minutes are written in Pyspark and it takes about 2 days (I leave the code to run in the evening) to run all the operation.
Then I can update the folders on HDFS for each day. On HDFS, the folder structure is as follows:
The created folder is with the same structure as on the HDFS, also the naming convention is the same (in the copy step, I rename each copied files to match the convention on HDFS).
In the final step, I write JAVA code in order to do operations to the HDFS. After some testing, I am able to update the data of each day on HDFS. I.e. It will delete e.g. the data under the folder ~/year=2019/month=1/day=2/ on HDFS and then upload all the folders/files under the newly created folder ~/20190102/ up to ~/year=2019/month=1/day=2/ on HDFS. I do such operation for each day. Then the corrupt blocks disappear, and the right files are uploaded to the correct path on HDFS.
According to my research, it is also possible to find the corrupt blocks before the power outage by using the fsimage file on Hadoop, but this means that I may corrupt the blocks on HDFS after the power outage. Therefore, I decide using the described approach to delete the corrupt blocks while still keeping the original files and update them on HDFS.
If anyone has a better or more efficient solution, please share!
We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we don´t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.
I have been seeing intense amount of disk usage on HDFS in last 10 days. As I see in the DataNode hosts on the Hosts tab on Cloudera Manager and Disk Usage charts on HDFS service usage has been almost tripled, ~7TB to ~20TB. At first I was thinking reason for this was something I did wrong in the upgrade I performed to CM and CDH on the 6th of those 10 days but realized it has started to occur before.
I've checked the File Browser on Cloudera Manager first, but saw no difference between size numbers there and before. I also have disk usage reports of last 4 days, they say there has been no increase.
Running hdfs dfsadmin -report also returns the same.
The dfs folders on Linux confirms the increasing usage but I can't tell what has been changed because there are millions of files and I don't know how to check last modified files in thousands of nested folders. Even if I find them, I can't tell what files are those on HDFS.
Then just recently I have been informed that another user on HDFS has been splitting their large files. They own nearly 2/3 of the all data. Could it cause this much of an increase if they split them into much more that are smaller than HDFS Block Size? If so, why can't I see it on Browser/Reports?
Is there any way to check what folders and files have been modified recently in the HDFS or other things I can check/do? Any suggestion or comment appreciated.
For checking the HDFS activities, Cloudera Navigator provides an excellent information about all the events that was logged in the HDFS.
After logging into Navigator, check for the audits tab. It also allows us to filter the activities such as delete,ipaddress, username and many such things.
The normal search page also provides us to filter the block size ( whether < 256Mb, > 256 Mb) , whether file or directory, the source type, the path, the replication count and many things more.
I want to move some files from one location to another location [both the locations are on HDFS] and need to verify that the data has moved correctly.
In order to compare the data moved, I was thinking of calculating hash code on both the files and then comparing if they are equal. If equal, I would term the data movement as correct else the data movement has not happened correctly.
But I have a couple of questions regarding this.
Do I need to use the hashCode technique at all in first place? I am using MapR distribution and I read somewhere that data movement when done, implement hashing of the data at the backend and make sure that it has been transferred correctly. So is it guaranteed that when data will be moved inside HDFS, it will be consistent and no anomaly will be inserted while movement?
Is there any other way that I can use in order to make sure that the data moved is consistent across locations?
Thanks in advance.
You are asking about data copying. Just use DistCp.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
#sample example
$hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.
EDIT
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting.
After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to ·verify that the copy was truly successful·. Since DistCp employs both MapReduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy.
EDIT
The common method I used to check the source and dist files was check the number of files and the specified size of each file. This can be done by generate a manifest at the source, then check at the dist both the number and size.
In HDFS, move doesn't physically move the data (blocks) across the data nodes. It actually changes the name space in HDFS metadata. Where as copying data from one HDFS location to another HDFS location, we have two ways;
copy
parallel copy distcp
In General copy, it doesn't check the integrity of blocks. If you want data integrity while copying file from one location to another location in same HDFS cluster use CheckSum concept by modifying the FsShell.java class or write your own class using HDFS Java API.
In Case of distCp, HDFS checks the data integrity while copying data from One HDFS cluster to another HDFS Cluster.
I'd uploaded 50GB data on Hadoop cluster. But Now i want to delete first row of data file.
This is time consuming if i remove that data & change manually. Then upload it again on HDFS.
Please reply me.
HDFS files are immutable (for all practical purposes).
You need to upload the modified file(s). You can do the change programatically with a M/R job that does a near-identity transformation, eg. running a streaming shell script that does sed, but the gist of it that you need to create new files, HDFS files cannot be edited.