We got hdfs of capacity 900TB. As the data stored is growing a lot its difficult to keep track of what is useful and what could be deleted.
I want to analyze hdfs usage for following pattern so that the capacity could be used optimally.
What is the frequently accessed data.
Data not being touched/accessed for long time (Possible candidate for deletion)
Data usage distribution by users.
Active users.
You can derive that data from:
(1) HDFS audit log (access patterns per user/ip)
(2) fsimage (access times per file, data not accessed)
(1) Do you have HDFS audit log enabled? Read more here.
(2) To start with fsimage read this - there is an example to get "Data not being touched/accessed for long time"
You may also want to consider HAR to archive the data (instead of delete) - thus reduce both storage usage and precious memory on the namenode.
Related
I am setting up a TimesTen In-Memory database and I am looking for guidance on the storage and location that I should use for the database's persistence files.
A TimesTen database consists of two types of file; checkpoint files (two) and transaction log files (always at least one, usually many).
There are 3 criteria to consider:
a) Data safety and availability (regular storage versus RAID). The database files are critical to the operation of the database and if they become inaccessible or are lost/damaged then your database will become inoperable and you will likely lose data. One way to protect against this is to use TimesTen's built in replication to implement high availability but even if you do that you may also want to protect your database files using some form of RAID storage. For performance reasons RAID-1 is preferred over RAID-5 or RAID-6. Use of NFS storage is not recommended for database files.
b) Capacity. Both checkpoint files are located in the same directory (Datastore attribute) and hence in the same filesystem. Each file can grow to a maximum size of PermSize + ~64 MB. Normally the space for these files is pre-allocated when the files are created, so it is less likely you will run out of space for them. By default, the transaction log files are also located in the same directory as the checkpoint files, though you can (and almost always should) place them in a different location by use of the LogDir attribute. The filesystem where the transaction logs are located should have enough space such that you never run out. If the database is unable to write data to the transaction logs it will stop processing transactions and operations will start to receive errors.
c) Performance. If you are using traditional spinning magnetic media, then I/O contention is a significant concern. The checkpoint files and the transaction log files should be stored on separate devices and separate from any other files that are subject to high levels of I/O activity. I/O loading and contention is less of a consideration for SSD storage and usually irrelevant for PCIe/NVMe flash storage.
While saving a file on the HDFS, it will split the file and store accordingly and stores the information on the edit log and it's all fine.
My question is: when I request the read operation to the namenode, from where it will look the datanode details?
From fsimage or the edit log?
If it is looking from the fsimage, a new fsimage will be generated at one hour interval.
If I'd request it before that time interval, what would happen?
Let's break down where each bit of information about the filesystem is stored on the NameNode.
The filesystem namespace (hierarchy of directories and files) is stored entirely in memory on the NameNode. There is no on-disk caching. Everything is in memory at all times. The FsImage is used only for persistence in case of failure. It is read only on startup. The EditLog stores changes to the FsImage; again, the EditLog is read only on startup. The active NameNode will never read an FsImage or EditLog during normal operation. However, a BackupNode or Standby NameNode (depending on your configuration) will periodically combine new EditLog entries with an old FsImage to produce a new FsImage. This is done to make startup more rapid and to reduce the size of on-disk data structures (if no compaction was done, the size of the EditLog would grow indefinitely).
The namespace discussed above includes the mapping from a file to the blocks contained within that file. This information is persisted in the FsImage / EditLog. However, the location of those blocks is not persisted into the FsImage. This information lives only transiently in the memory of the NameNode. On startup, the location of the blocks is reconstructed using the blocks reports received from all of the DataNodes. Each DataNode essentially tells the NameNode, "I have block ID AAA, BBB, CCC, ..." and so on, and the NameNode uses these reports to construct the location of all blocks.
To answer your question simply, when you request a read operation from the NameNode, all information is read from memory. Disk I/O is only performed on a write operation, to persist the change to the EditLog.
Primary Source: HDFS Architecture Guide; also I am a contributor to HDFS core code.
Why cant the metadata be stored in HDFS with 3 replication. Why does it store in the local disk?
Because it will take more time to name node in resource allocation due to several I/o operations. So it's better to store metadata in memory of name node.
There are multiple reason
If it stored on HDFS, there will be network I/O. which will be
slower.
Name-node will have dependency on data node for metadata.
Again Metadata will be require for metadata to Name-node, So that it can identify where the metadata is on hdfs.
METADATA is the data about the data such as where the block is stored in rack, so that it can be located and if metadata is stored in hdfs and if those datanodes fail's you will lose all your data because now you don't know how to access those blocks where your data was stored.
Even though if you keep replication factor more, for each changes in datanodes, the changes are made in replicas of data nodes as well as in namenode's edit log.
Now since we have 3 replicas of namenodes for every change in datanode it first have to change in
1.Its own replica blocks
In namenode and replicas of namenode.(edit_log is edited 3times )
This would cause to write more data than first.But data storage is not the only and major problem,the main problem is the time that is required to do all these operations.
Therefore namenodes are backup on remote disk,so that even though your whole clusters get fails(possibilities are less) you can always backup your data.
To save from namenode failure Hadoop comes with
Primary Namenode ->consisits of namespace image and edit logs.
Secondary Namenode -> merging namespace and editlogs so that edit logs dont become too large.
I am trying to understand the HBase architecture. I can see two different terms are used for same purpose.
Write Ahead Logs and Memstore, both are used to store new data that hasn't yet been persisted to permanent storage.
What's the difference between WAL and MemStore?
Update:
WAL - is used to recover not-yet-persisted data in case a server crashes.
MemStore - stores updates in memory as Sorted Keyvalue.
It seems lot of duplication of data before writing the data to Disk.
WAL is for recovery NOT for data duplication.(further see my answer here)
Pls go through below to understand more...
A Hbase Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
WAL can be disabled to improve performance bottleneck.
This is done by calling the Hbase client field
Mutation.writeToWAL(false)
General Note : Its general practice that while doing bulkloading data, WAL is disabled to get speed. But side effect is if you disable WAL you cant get back data to replay if in case any memory crashes.
More over if you use solr+ HBASE + LILY, i.e LILY Morphiline NRT indexes with hbase then it will work on WAL if you disable WAL for performance reasons, then Solr NRT indexing wont work. since Lily works on WAL.
please have a look at Hbase architecture section
I am new to Big data technologies, I have a question on how hbase is integrated with hadoop. What does it mean by "Hbase sits on top of HDFS"? . My understanding is HDFS is a collection of structured and unstructured data distributed across multiple nodes and HBase is structured data.
How is Hbase integrated with Hadoop to provide real time access to the underlying data. Do we have to write special jobs to build indexes and such? In other words is there an additional layer between Hbase and hdfs that has data in the structure HBase understands
HDFS is a distributed filesystem; One can do most regular FS operations on it such as listing files in a directory, writing a regular file, reading a part of the file, etc. Its not simply "a collection of structured or unstructured data" anymore than your EXT4 or NTFS filesystems are.
HBase is a in-memory Key-Value store which may persist to HDFS (it isn't a hard-requirement, you can run HBase on any distributed-filesystem). For any read key request asked of HBase, it will first check its runtime memory caches to see if it has a value cached, and otherwise visit its stored files on HDFS to seek and read out the specific value. There are various configurations in HBase offered to control the way the cache is utilised, but HBase's speed comes from a combination of caching and indexed persistence (faster, seek-ed file reads).
HBase's file-based persistence on HDFS does the key indexing automatically when it writes, so there is no manual indexing need by its users. These files are regular HDFS files, but specialised in format for HBase's usage, known as HFiles.
These articles are slightly dated, but are still very reflective of the architecture HBase uses: http://blog.cloudera.com/blog/2012/06/hbase-write-path/ and http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/, and should help if you want to dig deeper.
HDFS is a distributed file system, and HBase is a NoSQL database that depends on the HDFS filesystem to store it's data.
You should read up on these technologies, since your structured/unstructured comparison is not correct.
Update
You should check out the Google File System, MapReduce, and Bigtable papers if you are interested in the origins of these technologies.
Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google
file system." ACM SIGOPS operating systems review. Vol. 37. No. 5.
ACM, 2003.
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
Chang, Fay, et al. "Bigtable: A distributed storage system for
structured data." ACM Transactions on Computer Systems (TOCS) 26.2
(2008): 4.
It's easy to understand:
HDFS is a distributed filesytem and provides write and read through an apped model.
Hbase is a NOSQL database that builds on the HDFS filesystem and must depend on it.
This can be read about here: Apache hbase document