IIB 9 taking huge memory for the product feed flows - ibm-integration-bus

We have few product feeds flows which receives the data from data base. As soon as the data retrieved from the data base and store in Row variable the data flow engine consumes huge memory (consumes around 4GB of memory for 170 MB data). For understanding the problem we have created simple flow with MQInput node one compute node which returns the result set and stores it in Row variable. For this simple operation the memory consumption is 4.8GB and the actual data is 148 MB.

Related

Nifi Hbase data insertion taking more space than original data

I am doing data transformation in realtime using Nifi and after processing data is stored in Hbase. I am using puthbasejson for storing the data in hbase. While storing row key/id i am using is uuid. But the original data size in nifi data provonance or in online tool for a single JSON is 390bytes. But for 15 million data the size which it is taking 55 GB, according to which the data size for single record is 3.9 KB.
So, I am not getting how the data is stored, why the data size which is stored in hbase is more than the original data size and how I can reduce or optimize both in Hbase and Nifi(if any changes required).
JSON:
{"_id":"61577d7aba779647060cb4e9","index":0,"guid":"c70bff48-008d-4f5b-b83a-f2064730f69c","isActive":true,"balance":"$3,410.16","picture":"","age":40,"eyeColor":"green","name":"Delia Mason","gender":"female","company":"INTERODEO","email":"deliamason#interodeo.com","phone":"+1 (892) 525-3498","address":"682 Macon Street, Clinton, Idaho, 3964","about":"","registered":"2019-09-03T06:00:32 -06:-30"}
Steps to reproduce in nifi:
generate flowfile--->PuthbaseJSON(uuid rowkey)
Update1:
data stored in hbase:
I think the main thing you may be getting surprised by is that Hbase stores each column of a table as an individual record.
Suppose your UUID is 40 characters on average, field 1, 2 and 3 may each be 5 on average and perhaps it adds a timestamp of length 15.
Now originally you would have an amount of data of size 40+5+5+5+15 = 70
And after storing per row as per your screenshot, with three columns it would become 3*(40+5+15)=180 and this effect can increase if you have smaller or more fields.
I got this understanding from your screenshot but also from this article: https://dzone.com/articles/how-to-improve-apache-hbase-performance-via-data-s
Now the obvious way forward if you want to reduce your footprint, is to reduce the overhead. I believe the article recommends serialization, but perhaps it would also simply be possible to put the entire json body into one column, depending on how you plan to access it.

Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?
There are a few options
Local map
If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.
Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.
So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.
Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.
State-based
If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.
Joining data
A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.

How does HDFS stores single data which is larger than the block size?

How hadoop will split the data, in case one of my single data is more than the block size?
Eg. Data(talking about single record) I am storing is of size 80 mb and the block size is 64 mb, so how hadoop manages such scenario?
If we use 64MB of block size then data will be load into only two blocks(64MB and 16MB).Hence the size of metadata is decreased.
Edit:
Hadoop framework divides the large file into blocks (64MB or 128 MB) and stores in the slave nodes. HDFS is unware of the content of the block. While writing the data into block it may happen that the record crosses the block limit and part of same record is written on one block and the other is written on other block.
So, the way Hadoop tracks this split of data is by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record over heads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data needed to complete the record. This usually happens in the multi-line record as Hadoop is intelligent enough to handle the single line record scenario.
Usually, input split is configured same as the size of block size but consider if the input split is larger than the block size. Input split represents the size of data that will go in one mapper. Consider below example
• Input split = 256MB
• Block size = 128 MB
Then, mapper will process two blocks that can be on different machines. Which means to process the block the mapper will have to transfer the data between machines to process. Hence to avoid the unnecessary data movement (data locality) we usually keep the same Input split as block size.

What makes Spark fast if data size exceeds available memory?

Everywhere I try to understand spark it says it is fast because it keeps data in memory as opposed to map reduce. Lets take this examples -
I have a 5 node spark cluster, with 100 GB RAM each. Lets say I have 500 TB of data to run a spark job against. Now total data that spark can keep is 100*5=500 GB. If It can keep max of 500 GB of data only in memory at any point of time, what makes it lightning fast ??
Spark isn't magical and can't change fundamental principles of computing. Spark uses memory as a progressive enhancement and will fall back to disk I/O for huge datasets that can not be kept in memory. In a scenario where tables must be scanned from disks, spark performance should be comparable to other parallel solutions involving table scanning from disk.
Suppose only 0.1% of the 500 TB is "interesting". For instance, in a marketing funnel there are a lot of ad impressions, fewer clicks, even fewer sales, and less repeat sales. A program can filter through a huge dataset and tell Spark to cache in memory a smaller, filtered and corrected dataset needed for further processing. Spark caching of a smaller filtered data set is obviously much faster than repeated disk table scans and repeated processing of the larger raw data.

ScaleOut Software In Memory DataGrid Using Hadoop

I have been doing some reading on real time processing using hadoop and stumbled upon this http://www.scaleoutsoftware.com/hserver/
From what the documentation says, it looks like they implemented an in memory data grid using the hadoop worker/slave nodes. I have couple of questions here
From my understanding, if i have a data of size 100 GB, i would atleast need 100GB of ram across all nodes on my cluster just for the data + additional ram for task tracker, data node daemons + additional ram for the hServer service that would run on all these nodes. Is my understanding correct?
The software claims they can do real-time data processing by improving the latency issues in hadoop. Is it because, it allows us to write data to the in-memory grid instead of HDFS?
I am new to Big Data technologies. Apologize if some of the questions are naive.
[Full disclosure: I work at ScaleOut Software, the company which created ScaleOut hServer.]
In-memory data grids create a replica for every object to ensure high availability in case of failures.The aggregate amount of memory that is required is the memory used to store the objects with the addition of the memory used to store object replicas. In your example, you will need 200 GB of total memory: 100 GB for objects and 100 GB for replicas. For example, in a four-server cluster, each server needs 50 GB of memory available to the ScaleOut hServer service.
With the current release, ScaleOut hServer takes the first step in enabling real-time analytics by speeding up data access. It does this in two ways, which are implemented using different input/output formats. The first mode of operation uses the grid as a cache for HDFS, and the second uses the grid as the primary storage for a data set, providing support for fast-changing, memory-based data. Accessing data using an in-memory data grid reduces latency by eliminating disk I/O and minimizing network overhead. Also, caching HDFS data provides an additional performance boost by storing keys and values generated by the record reader instead of raw HDFS files in the grid.

Resources