How can we design a file retrieval structure - for a data structure that provides mapping of the file blocks to disk blocks, for simple and staggered striping of multimedia data?
The file blocks can be mapped to the disk blocks by maintaining an Array of File blocks which maps to the location on the Disk Blocks. This technique is used for the mapping blocks of file different disk blocks. Every file is mapped on to the starting address of the disk block. The disk blocks maintain another data structure of Linked list type, which at the end of each block stores address of the next disk block for the retrieval of the multimedia data by simple or staggered striping of multimedia data. This data structure can then be used for the retrieval by simple and staggered striping as it makes it easy to locate a file on different disks using the data structure.
By having this type of data structure, we map the retrieval of a single file stored on multiple disk blocks by the multimedia server. This helps the server provide distributed retrieval of a file from multiple disks.
Related
I want to use a Lucene MMapDirectory as a primary file store. Each file would be stored in a separate document as a byte array in a StoredField. All file properties that should be searchable, like file name, size etc., would be stored in indexable fields in the same document.
My questions would be:
What are the drawbacks of using Lucene directories for storing files, especially with regards to indexing and search performance and memory (RAM) consumption?
If this is not a "no-go", is there a better/faster way of storing files in the directory than as a byte array?
Short Answer
I really love Luсene and consider it to be the best opensource library, but I'm afraid that it's not a good decision to use it as a primary file source due to:
high CPU/memory overhead
slow index/query performance
high HDD utilization and doubled index size
weak capabilities to recovery
Long Answer
Under the hood lucene uses the following files to keep all stored fields in one segment:
the fields index file (.fdx),
the fields data file (.fdt).
You can read more about how it works in Lucene50StoredFieldsFormat’s docs.
This means in case of any I/O issue it is almost impossible to restore any file.
In order to return one file - lucene have to read and decompress binary data from the disc in block-by-block manner. This means high CPU overhead to decompress and high memory footprint to keep the whole file in java heap space. No streaming is also avaialbe - compared to file and network storages.
Maximum document size is limited by codec implementation - 2 GB per document
Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents. Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This architecture becomes a big problem for file storage:
you can not delete file - only mark as unavailable
merge operation requires 2x disc space and consumes a lot of resources and disc throughput - it creates new .fdt file and copies content of other .fdt files thru java code and java heap memory
So you won't be using MMapDirectory but an actual lucene index.
I have made good experiences with using lucene as the primary data-store for some projects.
Just be sure to also include a generated/natural unique ID, because the document IDs are not constant or reliable.
Also make sure you use a Directory implementation fitting to your use-case. I have switched to the normal RandomAccess implementation in the low-load case, since it uses less memory and is almost as fast.
I am trying to understand NiFi data flow mechanism . I read that Nifi has flow file which holds content and metadata (flow file attribute).
So I wanted to understand if I have 1 TB of data placed on edge node and would like to pass it to Nifi processors , is it going to load everything into memory to be used by processor?
FlowFiles (herein referred to as FF) are analogous to HTTP data in that they are comprised of content and attributes (metadata) as you highlight. However, the way these are handled within the NiFi framework is that the metadata resides in memory (up to a configured limit per connection) and the content portion of the FF is actually a pointer to the content on disk. That is once the content is received into NiFi, it is not longer held in memory at any point in time, utilizing a pass by reference approach allowing NiFi to handle arbitrarily large files. The only thing stored in memory is the metadata of FFs which is configurable to limit the number on a per connection basis.
When a processor needs to make a change, this exercises the copy on write approach for modifications.
In general, processors use a streaming approach for reading/writing data to/from the content repository. To that end, the included processors avoid storing the totality of a FF's content in memory as it could prove prohibitive. Simple routing and movement of data for an arbitrarily large file should be no issue; avoiding excess pressure on the heap memory. When looking at doing transformations/modifications on such files, the answer is that it is typically okay, but it depends on the specifics of the data type.
I am working on Hbase. I have query regarding how Hbase store the data in sorted order with LSM.
As per my understanding, Hbase use LSM Tree for data transfer in large scale data processing. when Data comes from client, it store in-memory sequentially first and than sort and store as B-Tree as Store file. Than it is merging the Store file with Disk B-Tree(of key). is it correct ? Am I missing something ?
If Yes, than in cluster env. there are multiple RegionServers who take the client request. On that case, How all the Hlogs (of each regionServer) merge with disk B-Tree(as existing key spread across the all dataNode disk) ?
Is it like Hlog only merge the data with Hfile of same regionServer ?
You can take a look at this two articles that describe exactly what you want
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
http://blog.cloudera.com/blog/2012/06/hbase-write-path/
In brief:
The client send data to the region server that is responsible to handle the key
(.META. contains key ranges for each region)
The user operation (e.g. put) is written to the Write-Ahead-Log (WAL, the HLog)
(The log is used just for "safety" if the region server crash the log is replayed to recover data not written to disk)
After writing to the log, data is also written to the MemStore
...once the memstore reach a threshold (conf property)
The memstore is flushed on disk, creating a single hfile
...when the number of hfiles grows too much (conf property) the compaction kicks in (merge)
In terms of on disk data structure:
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
The article above cover the hfile format...
it's an append only format, and can be seen like a b+tree. (Keeping in mind that this b+tree cannot be modified in place)
The HLog is only used for "safety", once the data is written to the hfiles, the logs can be thrown away
According to LSM-tree model in HBase the data consists of two parts - in-memory tree which contains most recent updates upon the data and disk store tree which arranges the rest part of the data into a form of immutable sequential B-tree located on the hard drive. From time to time HBase service decides that it has enough changes in memory to flush them into file storage. In that case it performs the rolling merge of data from the virtual space to disc, executing an operation similar to merge step of Merge sort algorithm.
In HBase infrastructure such data model is based on several components which organize all data across the cluster as a collections of LSM-trees located on slave servers and driven by the main master service. The system is driven by the following components:
HMaster - primary HBase service which maintains the correct state of slave Region Server nodes by managing and balancing the data among them. Besides it drives the changes of metadata information in the storage, like table or column creations and updates.
Zookeeper - represents a distributed cache used by HBase services and its clients to store reconciled up-to-date information about naming and configurations.
Regional servers - HBase worker nodes which perform the management and storage of pieces of the information in LSM-tree fashion
HDFS - used by Regional servers behind the scene for the actual storage of the data
From Low-level the most part of HBase functionality is located within Regional server which performs the read-write work upon the tables. Every table technically can be distributed across different Regional servers as a collection of of separate pieces called HRegions. Single Regional server node can hold several HRegions of one table. Each HRegion holds a certain range of rows shared between the memory and disc space and sorted by key attribute. These ranges do not intersect between different regions so we can relay on their sequential behavior across the cluster. Individual Regional server HRegion includes following parts:
Write Ahead Log (WAL) file - the first place when data is been persisted on every write operation before getting into Memory. As I've mentioned earlier the first part of the LSM-tree is kept in memory, which means that it can be affected by some external factors like power lose from example. Keeping the log file of such operations in a separate place would allow to restore this part easily without any looses.
Memstore - keeps a sorted collection of most recent updates of the information in the memory. It is the actual implementation of the first part of LMS-tree structure, described earlier. Periodically performs rolling merges into the store files called HFiles on the local hard drives
HFile - represents a small pieces of date received from the Memstore and saved in HDFS. Each HFile contains sorted KeyValues collection and B-Tree+ index which allows to seek the data without reading the whole file. Periodically HBase performs merge sort operations upon these files to make them fit the configured size of standard HDFS block and avoid small files problem
You can walk through these elements manually by pushing the data and passing it through the whole LSM-tree process. I described how to do it in my recent article:
https://oyermolenko.blog/2017/02/21/hbase-as-primary-nosql-hadoop-storage/
I would like to provide a way to recognize when a large file is fragmented to a certain extent, and alert the user when they should perform a defragmentation. In addition, I'd like to show them a visual display demonstrating how the file is actually broken into pieces across the disk.
I don't need to know how to calculate how fragmented it is, or how to make the visual display. What I need to know is two things: 1) how to identify the specific clusters on any disk which contain pieces of any particular given file, and 2) how to identify the total number of clusters on that disk. I would essentially need a list of all the clusters which contain pieces of this file, and where on the disk each of those clusters is located.
Most defragmentation utilities have a visual display showing how the files are spread across the disk. My display will show how one particular file is split up into different areas of a disk. I just need to know how I can retrieve the necessary data to tell me where the file's clusters/sectors are located on the disk, so I can further determine how fragmented it is.
You can use the DeviceIoControl function with the FSCTL_GET_RETRIEVAL_POINTERS control code.
The FSCTL_GET_RETRIEVAL_POINTERS operation retrieves a variably sized
data structure that describes the allocation and location on disk of a
specific file. The structure describes the mapping between virtual
cluster numbers (VCN offsets within the file or stream space) and
logical cluster numbers (LCN offsets within the volume space).
I know there have been similar posts on here but I can't find one that really has a solid answer.
We have a Hadoop cluster loaded with binary files. These files can range anywhere in size from a few hundred k to hundreds of mb.
We are currently processing these files using a custom record reader that reads the entire contents of the file into each map. From there we extract the appropriate metadata we want a serialize it into JSON.
The problem we are foreseeing is that we might eventually reach a size that our namenode can't handle. There is only so much memory to go around and having a namenode with a couple terabytes of memory seems ridiculous.
Is there a graceful way to process large binary files like this? Especially those which can't be split because we don't know what order the reducer will put them back together?
So not an answer as such, but i have so many questions that a list of comments would be more difficult to convey, so here goes:
You say you read the entire contents into memory for each map, are you able to elaborate on the actual binary input format of these files:
Do they contain logical records i.e. does a single input file represent a single record, or does it contain many records?
Are the files compressed (after-the-fact or some internal compression mechanism)?
How are you currently processing this file-at-once, what's you're overall ETL logic to convert to JSON?
Do you actually need to read the entire file read into memory before processing can begin or can you process once you have a buffer of some size populated (DOM vs SAX XML parsing for example).
My guess is that you can migrate some of your mapper logic to the record reader, and possibly even find a way to 'split' the file between multiple mappers. This would then allow you to address your scalability concerns.
To address some points in your question:
NameNode only requires memory to store information about the blocks (names, blocks[size, length, locations]). Assuming you assign it a decent memory footprint (GB's), there is no reason you can't have a cluster that holds Petabytes of data in HDFS storage (assuming you have enough physical storage)
Namenode doesn't have anything to do either with storage or processing.You should be concentrated on your Datanodes and Tasktrackers instead.Also I am not getting whether you are trying to address the storage issue or the processing of of your files here.If you are dealing with lots of Binary files, it is worth having a look at Hadoop SequenceFile. A SequenceFile is a flat file consisting of binary key/value pairs, hence extensively used in MapReduce as input/output formats. For a detailed explanation you can visit this page -
http://wiki.apache.org/hadoop/SequenceFile
When you have large binary files, use SequenceFile format as the input format and set the mapred input split size accordingly. You can set the number of mappers based on the total input size and the split size you had set. Hadoop will take care of splitting the input data.
If you have binary files compressed in some format, then hadoop cannot do this split. So the binary format has to be SequenceFile.