I have been seeing intense amount of disk usage on HDFS in last 10 days. As I see in the DataNode hosts on the Hosts tab on Cloudera Manager and Disk Usage charts on HDFS service usage has been almost tripled, ~7TB to ~20TB. At first I was thinking reason for this was something I did wrong in the upgrade I performed to CM and CDH on the 6th of those 10 days but realized it has started to occur before.
I've checked the File Browser on Cloudera Manager first, but saw no difference between size numbers there and before. I also have disk usage reports of last 4 days, they say there has been no increase.
Running hdfs dfsadmin -report also returns the same.
The dfs folders on Linux confirms the increasing usage but I can't tell what has been changed because there are millions of files and I don't know how to check last modified files in thousands of nested folders. Even if I find them, I can't tell what files are those on HDFS.
Then just recently I have been informed that another user on HDFS has been splitting their large files. They own nearly 2/3 of the all data. Could it cause this much of an increase if they split them into much more that are smaller than HDFS Block Size? If so, why can't I see it on Browser/Reports?
Is there any way to check what folders and files have been modified recently in the HDFS or other things I can check/do? Any suggestion or comment appreciated.
For checking the HDFS activities, Cloudera Navigator provides an excellent information about all the events that was logged in the HDFS.
After logging into Navigator, check for the audits tab. It also allows us to filter the activities such as delete,ipaddress, username and many such things.
The normal search page also provides us to filter the block size ( whether < 256Mb, > 256 Mb) , whether file or directory, the source type, the path, the replication count and many things more.
Related
I am new to the HDFS system and I come across a HDFS question.
We have a HDFS file system, with the namenode on a server (with this server named as 0002) and datanode on two other servers (with these two severs named as 0004 and 0005 respectively). The original data comes from a Flume application and with the "Sink" in the Flume as HDFS. The Flume will write the original data (txt files) into the datanode on the servers 0004 and 0005.
So, the original data is replicated twice and saved under two servers. The system work well for some time until one day there is a power outage. When restarting the servers, the datanode servers (0004 and 0005) are restarted before the namenode (0002) server. In this case, the original data is still saved onto the 0004 and 0005 server, however the metadata information on the namenode(0002) is lost. The block information become corrupt. The question is how to fix the corrupt blocks without losing the original data?
For example, when we check on the namenode
hadoop fsck /wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv -files -blocks -locations
We find the filename on the datanode, but the block is corrupt. The corresponding file name is:
blk_1090579409_16840906
When we go to the datanode (e.g. 0004) server, we can search the location of this file by
find ./ -name "*blk_1090579409*"
We have found the the file corresponding to the csv file under the virtual path of the HDFS system "/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv". The file is saved under folder: "./subdir0/subdir235/" and we can open it and find it is in the correct format. The corresponding .meta data is in binary form(?) and we can not read directly.
./subdir0/subdir235/blk_1090579409
The question is, given that we have found the original file (blk_1090579409), how could we restore the corrupt HDFS system using and without losing these correct original files?
After some research, I find a solution, which may be not efficient but works. If someone comes with a better solution, please let me know.
The whole idea is to copy all the files from the HDFS, arrange these files by year/day/hour/minute into different folders and then update these folders onto HDFS.
I have two datanodes (0004 and 0005) where data is stored. The total data size is of the magnitude of 10+ terabytes. The folder structure is as following (it is the same as in the question, one displayed on linux and the other on Windows):
The replication factor is set as 2, which means (if no mistake happens) that each datanode will have one and only one copy of the original file. Therefore, we just scan the folders/files on one datanode (on server 0004, about 5+ terabytes). Based on the modification date and the timestep in each file, copy the files into a new folder on a backup server/driver. Luckily, in the original files, time step information is available, e.g. 2020-03-02T09:25. I round the time to the nearest five minutes, and parent folder is for each day, with the newly created folder structure as:
The code of scan and copy the files from the datanode into the new folders by each five minutes are written in Pyspark and it takes about 2 days (I leave the code to run in the evening) to run all the operation.
Then I can update the folders on HDFS for each day. On HDFS, the folder structure is as follows:
The created folder is with the same structure as on the HDFS, also the naming convention is the same (in the copy step, I rename each copied files to match the convention on HDFS).
In the final step, I write JAVA code in order to do operations to the HDFS. After some testing, I am able to update the data of each day on HDFS. I.e. It will delete e.g. the data under the folder ~/year=2019/month=1/day=2/ on HDFS and then upload all the folders/files under the newly created folder ~/20190102/ up to ~/year=2019/month=1/day=2/ on HDFS. I do such operation for each day. Then the corrupt blocks disappear, and the right files are uploaded to the correct path on HDFS.
According to my research, it is also possible to find the corrupt blocks before the power outage by using the fsimage file on Hadoop, but this means that I may corrupt the blocks on HDFS after the power outage. Therefore, I decide using the described approach to delete the corrupt blocks while still keeping the original files and update them on HDFS.
If anyone has a better or more efficient solution, please share!
We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we donĀ“t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.
I'm running a complex query in hive which, when ran, starts using a huge amount of local disk space in /tmp folder and eventually ends with a space error as the /tmp folder fills up completely with the intermediate map-reduce results because of the mentioned query (/tmp folder is created in a separate partition, having 100 GB of empty space). While running it says:
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
As you can see above, Hive is somehow running in local mode. After doing some research over the net, I checked a few relevant parameters and below are the results:
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive> set mapred.job.tracker;
mapred.job.tracker=local
hive> set mapred.local.dir;
mapred.local.dir=/tmp/hadoop-hive/mapred/local
So I have two questions regarding this:
Can this be the reason why the map-reduce jobs are consuming space on local disk instead of hdfs /tmp folder, as is the case typically with pig scripts?
How to make Hive run in distributed mode, given the current settings? Please mind that I'm using MRV2 in the cluster, but the above options are confusing as they seem to be relevant for MRV1. I can be wrong here, being a newbee.
Any help will be much appreciated!
It turns out that I was missing out on the bare essentials. After setting HADOOP_MAPRED_HOME to /usr/lib/hadoop-mapreduce in all the nodes, all the issues were fixed.
I know that HDFS is about write once and read multiple times type. As per my knowledge it's not possible to update a file (randomly) in HDFS because a file is stored in a distributed environment (as blocks) with each block replicated on other nodes which would be difficult for data-node to update even these replicated blocks.
But, my question is it possible to update files in HDFS using hue tool?. Because, I've updated many files (stored in HDFS) using the hue tool and ran map-reduce jobs on it. So, how is it possible for hue to update the files in HDFS. Does hue do something in background? Are the updates made through hue really updated to same file? or hue deletes the file and re-writes the whole file (including our new data we want to update)?
Hue deletes and re-writes the whole file as HDFS does not support editions. You can notice that Hue limits the edition to only small files for now.
Here is a blog post to learn more about the HDFS Filebrowser.
We have a small Hbase cluster on EC2 with 6 region servers. Lately we found that the data in one of the column families is really not that useful for us and decided to chuck it. This particular column family takes more than 50 percent of space on disk.
We altered the table,removes the column family and ran major compaction.
We also ran major compaction on the '-ROOT-' and the '.META.' tables.
But there is still no reduction in total DFS file size?
Are we missing something here.
Any help/pointers would be greatly appreciated.
regards.
Just to add another thing to check - in Hbase 0.90.4 at least, dropping a table removes the files from HDFS but the contents of the .logs directory are not necessarily.
For example, run hadoop fs -du /yourHbaseDirInDFS and you will see the .logs directory with a chunk of data in it still. This does not seem to go away until the HBase cluster is restarted. Alternately I guess you could delete the log files manually, but it seems better to me to let hbase do it.
Got it!
It was a bug in Hbase. They are not deleting the filer from the HDFS. We had to find and delete the files from the hadoop-files system.