NiFi | FlowFile Memory dependency - apache-nifi

I am trying to understand NiFi data flow mechanism . I read that Nifi has flow file which holds content and metadata (flow file attribute).
So I wanted to understand if I have 1 TB of data placed on edge node and would like to pass it to Nifi processors , is it going to load everything into memory to be used by processor?

FlowFiles (herein referred to as FF) are analogous to HTTP data in that they are comprised of content and attributes (metadata) as you highlight. However, the way these are handled within the NiFi framework is that the metadata resides in memory (up to a configured limit per connection) and the content portion of the FF is actually a pointer to the content on disk. That is once the content is received into NiFi, it is not longer held in memory at any point in time, utilizing a pass by reference approach allowing NiFi to handle arbitrarily large files. The only thing stored in memory is the metadata of FFs which is configurable to limit the number on a per connection basis.
When a processor needs to make a change, this exercises the copy on write approach for modifications.
In general, processors use a streaming approach for reading/writing data to/from the content repository. To that end, the included processors avoid storing the totality of a FF's content in memory as it could prove prohibitive. Simple routing and movement of data for an arbitrarily large file should be no issue; avoiding excess pressure on the heap memory. When looking at doing transformations/modifications on such files, the answer is that it is typically okay, but it depends on the specifics of the data type.

Related

Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?
There are a few options
Local map
If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.
Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.
So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.
Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.
State-based
If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.
Joining data
A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.

What are the drawbacks of using a Lucene directory as a primary file store?

I want to use a Lucene MMapDirectory as a primary file store. Each file would be stored in a separate document as a byte array in a StoredField. All file properties that should be searchable, like file name, size etc., would be stored in indexable fields in the same document.
My questions would be:
What are the drawbacks of using Lucene directories for storing files, especially with regards to indexing and search performance and memory (RAM) consumption?
If this is not a "no-go", is there a better/faster way of storing files in the directory than as a byte array?
Short Answer
I really love Luсene and consider it to be the best opensource library, but I'm afraid that it's not a good decision to use it as a primary file source due to:
high CPU/memory overhead
slow index/query performance
high HDD utilization and doubled index size
weak capabilities to recovery
Long Answer
Under the hood lucene uses the following files to keep all stored fields in one segment:
the fields index file (.fdx),
the fields data file (.fdt).
You can read more about how it works in Lucene50StoredFieldsFormat’s docs.
This means in case of any I/O issue it is almost impossible to restore any file.
In order to return one file - lucene have to read and decompress binary data from the disc in block-by-block manner. This means high CPU overhead to decompress and high memory footprint to keep the whole file in java heap space. No streaming is also avaialbe - compared to file and network storages.
Maximum document size is limited by codec implementation - 2 GB per document
Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents. Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This architecture becomes a big problem for file storage:
you can not delete file - only mark as unavailable
merge operation requires 2x disc space and consumes a lot of resources and disc throughput - it creates new .fdt file and copies content of other .fdt files thru java code and java heap memory
So you won't be using MMapDirectory but an actual lucene index.
I have made good experiences with using lucene as the primary data-store for some projects.
Just be sure to also include a generated/natural unique ID, because the document IDs are not constant or reliable.
Also make sure you use a Directory implementation fitting to your use-case. I have switched to the normal RandomAccess implementation in the low-load case, since it uses less memory and is almost as fast.

Log Structured Merge Tree in Hbase

I am working on Hbase. I have query regarding how Hbase store the data in sorted order with LSM.
As per my understanding, Hbase use LSM Tree for data transfer in large scale data processing. when Data comes from client, it store in-memory sequentially first and than sort and store as B-Tree as Store file. Than it is merging the Store file with Disk B-Tree(of key). is it correct ? Am I missing something ?
If Yes, than in cluster env. there are multiple RegionServers who take the client request. On that case, How all the Hlogs (of each regionServer) merge with disk B-Tree(as existing key spread across the all dataNode disk) ?
Is it like Hlog only merge the data with Hfile of same regionServer ?
You can take a look at this two articles that describe exactly what you want
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
http://blog.cloudera.com/blog/2012/06/hbase-write-path/
In brief:
The client send data to the region server that is responsible to handle the key
(.META. contains key ranges for each region)
The user operation (e.g. put) is written to the Write-Ahead-Log (WAL, the HLog)
(The log is used just for "safety" if the region server crash the log is replayed to recover data not written to disk)
After writing to the log, data is also written to the MemStore
...once the memstore reach a threshold (conf property)
The memstore is flushed on disk, creating a single hfile
...when the number of hfiles grows too much (conf property) the compaction kicks in (merge)
In terms of on disk data structure:
http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
The article above cover the hfile format...
it's an append only format, and can be seen like a b+tree. (Keeping in mind that this b+tree cannot be modified in place)
The HLog is only used for "safety", once the data is written to the hfiles, the logs can be thrown away
According to LSM-tree model in HBase the data consists of two parts - in-memory tree which contains most recent updates upon the data and disk store tree which arranges the rest part of the data into a form of immutable sequential B-tree located on the hard drive. From time to time HBase service decides that it has enough changes in memory to flush them into file storage. In that case it performs the rolling merge of data from the virtual space to disc, executing an operation similar to merge step of Merge sort algorithm.
In HBase infrastructure such data model is based on several components which organize all data across the cluster as a collections of LSM-trees located on slave servers and driven by the main master service. The system is driven by the following components:
HMaster - primary HBase service which maintains the correct state of slave Region Server nodes by managing and balancing the data among them. Besides it drives the changes of metadata information in the storage, like table or column creations and updates.
Zookeeper - represents a distributed cache used by HBase services and its clients to store reconciled up-to-date information about naming and configurations.
Regional servers - HBase worker nodes which perform the management and storage of pieces of the information in LSM-tree fashion
HDFS - used by Regional servers behind the scene for the actual storage of the data
From Low-level the most part of HBase functionality is located within Regional server which performs the read-write work upon the tables. Every table technically can be distributed across different Regional servers as a collection of of separate pieces called HRegions. Single Regional server node can hold several HRegions of one table. Each HRegion holds a certain range of rows shared between the memory and disc space and sorted by key attribute. These ranges do not intersect between different regions so we can relay on their sequential behavior across the cluster. Individual Regional server HRegion includes following parts:
Write Ahead Log (WAL) file - the first place when data is been persisted on every write operation before getting into Memory. As I've mentioned earlier the first part of the LSM-tree is kept in memory, which means that it can be affected by some external factors like power lose from example. Keeping the log file of such operations in a separate place would allow to restore this part easily without any looses.
Memstore - keeps a sorted collection of most recent updates of the information in the memory. It is the actual implementation of the first part of LMS-tree structure, described earlier. Periodically performs rolling merges into the store files called HFiles on the local hard drives
HFile - represents a small pieces of date received from the Memstore and saved in HDFS. Each HFile contains sorted KeyValues collection and B-Tree+ index which allows to seek the data without reading the whole file. Periodically HBase performs merge sort operations upon these files to make them fit the configured size of standard HDFS block and avoid small files problem
You can walk through these elements manually by pushing the data and passing it through the whole LSM-tree process. I described how to do it in my recent article:
https://oyermolenko.blog/2017/02/21/hbase-as-primary-nosql-hadoop-storage/

what does " local caching of data" mean in the context of this article?

From the following paragraphs of Text——
(http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...
There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.
But while HDFS is very scalable, its high performance design also restricts it to a
particular class of applications; it is not as general-purpose as NFS. There are a large
number of additional decisions and trade-offs that were made with HDFS. In particular:
Applications that use HDFS are assumed to perform long sequential streaming reads from
files. HDFS is optimized to provide streaming read performance; this comes at the expense of
random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files
after they have already been closed are not supported. (An extension to Hadoop will provide
support for appending new data to the ends of files; it is scheduled to be included in
Hadoop 0.19 but is not available yet.)
Due to the large size of files, and the sequential nature of reads, the system does
not provide a mechanism for local caching of data. The overhead of caching is great enough
that data should simply be re-read from HDFS source.
Individual machines are assumed to fail on a frequent basis, both permanently and
intermittently. The cluster must be able to withstand the complete failure of several
machines, possibly many happening at the same time (e.g., if a rack fails all together).
While performance may degrade proportional to the number of machines lost, the system as a
whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.
Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.
Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.
It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).
With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data?

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?
When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Resources