How does Namenode reconstruct the full block information after restart? - hadoop

I am trying to understand Namenode and I referred to online material and referring to book Hadoop: The definitive guide as well.
I understand that Namenode has concept like : "edit logs", "fsimage", and I can see the following files in my Namenode.
========================================================================
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 23 22:53 edits_0000000000000000001-0000000000000000001
-rw-r--r-- 1 root root 1048576 Nov 23 23:42 edits_0000000000000000002-0000000000000000002
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 24 00:07 edits_0000000000000000003-0000000000000000003
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 24 21:03 edits_0000000000000000004-0000000000000000004
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 24 22:59 edits_0000000000000000005-0000000000000000005
-rw-r--r-- 1 root root 1048576 Nov 24 23:00 edits_0000000000000000006-0000000000000000006
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 25 21:15 edits_0000000000000000007-0000000000000000007
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 25 21:34 edits_0000000000000000008-0000000000000000008
-rw-r--r-- 1 root root 1048576 Nov 26 02:13 edits_inprogress_0000000000000000009
-rw-rw-r-- 1 vevaan24 vevaan24 355 Nov 25 21:15 fsimage_0000000000000000006
-rw-rw-r-- 1 vevaan24 vevaan24 62 Nov 25 21:15 fsimage_0000000000000000006.md5
-rw-r--r-- 1 root root 355 Nov 26 00:12 fsimage_0000000000000000008
-rw-r--r-- 1 root root 62 Nov 26 00:12 fsimage_0000000000000000008.md5
-rw-r--r-- 1 root root 2 Nov 26 00:12 seen_txid
-rw-rw-r-- 1 vevaan24 vevaan24 201 Nov 26 00:12 VERSION
In that book it was mentioned that fsimage doesn't store the block locations in it.
I have following questions:
1) Does edit logs store the block locations as well? (for the new transactions?)
2) When Namenode and Datanode are restarted how does Namenode get the block address? My doubt is NN read fsimage to reconstuct the filesystem info, but fsimage doesn't have the info of block location, so how this information is reconstructed?
3) Is it true that fsimage stores BLOCK ID only, and if so, is BLOCK ID unique across Datanodes? Is BLOCK ID same as that of BLOCK address ?

Block locations i.e., the datanodes on which the blocks are stored is neither persisted in the fsimage file nor in the edit log. Namenode keeps this mapping only in the memory.
It is the responsibility of each datanode to hold the information of the list of blocks it is storing.
During restart, Namenode loads the fsimage file into memory and apply the edits from the edit log, the missing information of block locations is obtained from the datanodes as they check in with their block lists. Namenode, with the information from block lists, constructs the mapping of blocks with their locations in its memory.
fsimage has more than the Block ID. It holds the information like blocks of the file, block size, replication factor, access time, modification time, file permissions but not the location of the blocks.
Yes, Block IDs are unique. Block address would refer the address of the datanodes in which the block resides.

Related

Retiring the once only volume, holding important looking files

/volume1 was once my only volume, and it's has been joined by /volume2 in preparation for retiring /volume1.
Having relocated all my content I can see lots of files I cannot explain. Unusually they are all prefixed with #, e.g.
/volume1$ ls -als
total 430144
0 drwxr-xr-x 1 root root 344 May 2 16:19 .
4 drwxr-xr-x 24 root root 4096 May 2 16:18 ..
0 drwxr-xr-x 1 root root 156 Jun 29 15:57 #appstore
0 drwx------ 1 root root 0 Apr 11 04:03 #autoupdate
0 drwxr-xr-x 1 root root 14 May 2 16:19 #clamav
332 -rw------- 1 root root 339245 Jan 23 13:50 #cnid_dbd.core.gz
0 drwxr-xr-x 1 admin users 76 Aug 19 2020 #database
0 drwx--x--x 1 root root 174 Jun 29 15:57 #docker
0 drwxrwxrwx+ 1 root root 24 Jan 23 15:27 #eaDir
420400 -rw------- 1 root root 430485906 Jan 4 05:06 #G1.core.gz
0 drwxrwxrwx 1 root root 12 Jan 21 13:47 #img_bkp_cache
0 drwxr-xr-x 1 root root 14 Dec 29 18:45 #maillog
0 drwxr-xr-x 1 root root 60 Dec 29 18:39 #MailScanner
0 drwxrwxr-x 1 root root 106 Oct 7 2018 #optware
7336 -rw------- 1 root root 7510134 Jan 24 01:33 #Plex.core.gz
0 drwxr-xr-x 1 postfix root 166 Oct 12 2020 #postfix
2072 -rw------- 1 root root 2118881 Jan 17 03:47 #rsync.core.gz
0 drwxr-xr-x 1 root root 88 May 2 16:19 #S2S
0 drwxr-xr-x 1 root root 0 Jan 23 13:50 #sharesnap
0 drwxrwxrwt 1 root root 48 Jun 29 15:57 #tmp
I have two questions
what does the # prefix signify, and
how can I move/remove them, given that something's going to miss these files.
From experimentation it seems the answers are:
Nothing - they're a convention used by the Synology packaging system, it appears.
With one exception I didn't need to consider the consequences of removing the file system on which these stood. The #appstore directory clearly holds the installed Synology packages, and after pulling /volume1 they showed in the Package Center as "needing repair". Once they were repaired, the same # prefixed directories appeared in the new volume - and the configuration was retained - so it appears these directories hold only the immutable software components.
The exception: I use ipkg mostly for fetchmail. I took a listing of the installed packages as well as the fetchmailrc, and then reinstalled the same packages once "Easy Bootstrap Installer" was ready for use (repair didn't work on this, but uninstall and reinstall worked fine).

How can I segment/split a file in NiFi getting the small pieces?

Good night
I got 5 files
[azureuser#ibpoccloudera output]$ pwd
/home/azureuser/logs_auditoria/output
[azureuser#ibpoccloudera output]$ ls -lrth
total 5.1G
-rw-r--r-- 1 nifi nifi 1.2G Oct 6 00:38 auditoria_20200928.txt
-rw-r--r-- 1 nifi nifi 433M Oct 6 00:38 auditoria_20200927.txt
-rw-r--r-- 1 nifi nifi 1.5G Oct 6 00:38 auditoria_20200929.txt
-rw-r--r-- 1 nifi nifi 1.6G Oct 6 00:38 auditoria_20200925.txt
-rw-r--r-- 1 nifi nifi 427M Oct 6 00:38 auditoria_20200926.txt
And I want to split them in smaller pieces and put it in another directory using NiFi. I use this processor secuence:
Getfile -> SegmentContent -> Putfile
GetFile
SegmentContent
PutFile
But when I check my output directory (PutFile) I got the last segment that gave me the SegmentContent.
There are any option to get something like linux split
[azureuser#ibpoccloudera output]$ split -b 524288000 auditoria_20200929.txt auditoria_20200929
[azureuser#ibpoccloudera output]$ ls -lrth
total 6.5G
-rw-r--r-- 1 nifi nifi 1.2G Oct 6 00:38 auditoria_20200928.txt
-rw-r--r-- 1 nifi nifi 433M Oct 6 00:38 auditoria_20200927.txt
-rw-r--r-- 1 nifi nifi 1.5G Oct 6 00:38 auditoria_20200929.txt
-rw-r--r-- 1 nifi nifi 1.6G Oct 6 00:38 auditoria_20200925.txt
-rw-r--r-- 1 nifi nifi 427M Oct 6 00:38 auditoria_20200926.txt
-rw-rw-r-- 1 azureuser azureuser 500M Oct 6 00:54 auditoria_20200929aa
-rw-rw-r-- 1 azureuser azureuser 500M Oct 6 00:55 auditoria_20200929ab
-rw-rw-r-- 1 azureuser azureuser 500M Oct 6 00:55 auditoria_20200929ac
-rw-rw-r-- 1 azureuser azureuser 14M Oct 6 00:55 auditoria_20200929ad
I solved the problem using a SplitText and UpdateAttribute.
I use SplitText because I have a json file so, if I use SegmentContent, sometimes y cut one record and get errors.
and with UpdateAttribute, I change the name of the file by UUID, so I am pretty sure that I don't have any repeated records.
SplitText
UpgradeAttribute

Logstash Persistent Queues Not Creating Tail Files

I have just started playing with logstash 5.4.0 persistent queues.
I have configured logstash to use persistent queues though this always writes to head and never rolls the head over to tail.
My logstash.yml is as follows
queue.checkpoint.writes: 1
queue.type: persisted
path.queue: /usr/share/logstash/persisted-queues
queue.page_capacity: 1000mb
And creates
-rw-r--r-- 1 root root 1048576000 Feb 23 14:14 page.1
-rw-r--r-- 1 root root 34 Feb 23 14:14 checkpoint.head
a few minute later I get
-rw-r--r-- 1 root root 1048576000 Feb 23 14:15 page.1
-rw-r--r-- 1 root root 34 Feb 23 14:14 checkpoint.head
The size of the file remains consistent and when I cat the page file I can see it changing.

What information Namenode stores in Hard disk and in memory?

I am trying to understand Namenode and I referred to online material and referring to book Hadoop: The definitive guide as well.
I understand that Namenode has concept like : "edit logs", "fsimage", and I can see the following files in my Namenode.
========================================================================
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 23 22:53 edits_0000000000000000001-0000000000000000001
-rw-r--r-- 1 root root 1048576 Nov 23 23:42 edits_0000000000000000002-0000000000000000002
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 24 00:07 edits_0000000000000000003-0000000000000000003
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 24 21:03 edits_0000000000000000004-0000000000000000004
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 24 22:59 edits_0000000000000000005-0000000000000000005
-rw-r--r-- 1 root root 1048576 Nov 24 23:00 edits_0000000000000000006-0000000000000000006
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 25 21:15 edits_0000000000000000007-0000000000000000007
-rw-rw-r-- 1 vevaan24 vevaan24 1048576 Nov 25 21:34 edits_0000000000000000008-0000000000000000008
-rw-r--r-- 1 root root 1048576 Nov 26 02:13 edits_inprogress_0000000000000000009
-rw-rw-r-- 1 vevaan24 vevaan24 355 Nov 25 21:15 fsimage_0000000000000000006
-rw-rw-r-- 1 vevaan24 vevaan24 62 Nov 25 21:15 fsimage_0000000000000000006.md5
-rw-r--r-- 1 root root 355 Nov 26 00:12 fsimage_0000000000000000008
-rw-r--r-- 1 root root 62 Nov 26 00:12 fsimage_0000000000000000008.md5
-rw-r--r-- 1 root root 2 Nov 26 00:12 seen_txid
-rw-rw-r-- 1 vevaan24 vevaan24 201 Nov 26 00:12 VERSION
=========================================================================
As expected I see all these files in my namenode. However I haven't understood this concept, I have following questions, can anyone please help me understand this.
Q1) What are fsimage files? Why many fsimage files are present?
Q2) What are edit_000 file? Why many edit_000 file are present?
Q3) What are there .md5 files? What purpose do they serve?
I also read that NAMENODE keeps some data in MEMORY and some data it keeps in HARD-DISK, BUT it is bit confusing to understand what kind of information is stored in hard disk and what remains in memory.
Q4) Do Namenode memory have information taken from fsimage or edit_000 OR both?
Q5) When Namenode and Datanode is restarted, how is the meta-data constructed (that is, which file stored in which datanode, block etc.).
Ok I try to explain:
EditLog
The EditLog is a transactional log to record every change that occurs to file system metadata. For example Creating a new file, renaming the file and so on. This will always generate an entry in the EditLog.
FsImage
This file contains the entire file system namespace, including the mapping of blocks to files and file system properties. So wich file consists of which blocks. Which blocks are saved where and so on.
If you start your NameNode, Hadoop loads the complete FsImage file into your memory. After that applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. This only happens once (on startup). After that Hadoop is only working with the in-memory representation. The FsImage on your HDD ist not touched.
Some of your Questions
Q1) Why many fsimage files are present?
As is explaned the FsImage is loaded, EditLog is flushed and than a new Version is saved.
Q1) Why many edit_000 file are present?
After Hadoop flushed the EditLog and persist a new Version of FsImage it starts a new EditLog. This is called a checkpoint in Hadoop
Q3) What are there .md5 files? What purpose do they serve?
MD5 is a hash to check if the FsImage is not broken.
Q5) When Namenode and Datanode is restarted, how is the meta-data constructed (that is, which file stored in which datanode, block etc.).
The information is persisted in the FsImage.
I hope i could help.

hive script file not found exception

I am running below command file is in my local directory but I am getting below error while running the file.
[hdfs#ip-xxx-xxx-xx-xx scripts]$ ls -lrt
total 28
-rwxrwxrwx. 1 root root 17 Apr 1 15:53 hive.hive
-rwxrwxrwx 1 hdfs hadoop 88 May 7 11:53 shell_fun
-rwxrwxrwx 1 hdfs hadoop 262 May 7 12:23 first_hive
-rwxrwxrwx 1 root root 88 May 7 16:59 311_cust_shell
-rwxrwxrwx 1 root root 822 May 8 20:29 script_1
-rw-r--r-- 1 hdfs hadoop 31 May 8 20:30 script_1.log
**-rwxrwxrwx 1 hdfs hdfs 64 May 8 22:07 **hql2.sql***
[hdfs#ip-xxx-xxx-xx-xx scripts]$ hive -f hql2.sql
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.3.4.0-3485/0/hive-log4j.properties Could not open input file for reading.
(File file:/home/ec2-user/scripts/hive/scripts/hql2.sql does not exist)
[hdfs#ip-xxx-xxx-xx-xx scripts]$

Resources