Nifi Hbase data insertion taking more space than original data - hadoop

I am doing data transformation in realtime using Nifi and after processing data is stored in Hbase. I am using puthbasejson for storing the data in hbase. While storing row key/id i am using is uuid. But the original data size in nifi data provonance or in online tool for a single JSON is 390bytes. But for 15 million data the size which it is taking 55 GB, according to which the data size for single record is 3.9 KB.
So, I am not getting how the data is stored, why the data size which is stored in hbase is more than the original data size and how I can reduce or optimize both in Hbase and Nifi(if any changes required).
JSON:
{"_id":"61577d7aba779647060cb4e9","index":0,"guid":"c70bff48-008d-4f5b-b83a-f2064730f69c","isActive":true,"balance":"$3,410.16","picture":"","age":40,"eyeColor":"green","name":"Delia Mason","gender":"female","company":"INTERODEO","email":"deliamason#interodeo.com","phone":"+1 (892) 525-3498","address":"682 Macon Street, Clinton, Idaho, 3964","about":"","registered":"2019-09-03T06:00:32 -06:-30"}
Steps to reproduce in nifi:
generate flowfile--->PuthbaseJSON(uuid rowkey)
Update1:
data stored in hbase:

I think the main thing you may be getting surprised by is that Hbase stores each column of a table as an individual record.
Suppose your UUID is 40 characters on average, field 1, 2 and 3 may each be 5 on average and perhaps it adds a timestamp of length 15.
Now originally you would have an amount of data of size 40+5+5+5+15 = 70
And after storing per row as per your screenshot, with three columns it would become 3*(40+5+15)=180 and this effect can increase if you have smaller or more fields.
I got this understanding from your screenshot but also from this article: https://dzone.com/articles/how-to-improve-apache-hbase-performance-via-data-s
Now the obvious way forward if you want to reduce your footprint, is to reduce the overhead. I believe the article recommends serialization, but perhaps it would also simply be possible to put the entire json body into one column, depending on how you plan to access it.

Related

Biquery table data performance

In BigQuery I have a table storing 237 GB data. I don't have any columns on which i can create partition as it does not store any date fields
When I am using it in the query the processing says 77 GB data will be processed but in bytes shuffled i see 7 GB data.
what is the actual GB of data processed here?
is there any way i could restructure this table ?
BigQuery operates column-wise. If you only choose the columns you really need in a query then you're optimizing cost already. Traditionally databases operate row-wise, so this can be a bit counter-intuitive.
There's also this great blog article on optimizing for costs.

All else held equal, which is the faster querying option: Milvus, RocksDB, or Apache HBase

I have a requirement to store billions of records (with capacity up to one trillion records) in a database (total size is in terms of petabytes). The records are textual fields with about 5 columns representing transactional information.
I want to be able to query data in the database incredibly quickly, so I was researching Milvus, Apache HBase, and RocksDB. Based on my research, all three are incredibly fast and work well with large amounts of data. All else equal, which of these three is the fastest?
What type of data are you storing in the database?
Milvus is used for vector storage and computation.
If you want to search by the semantics of the text, milvus is the fastest option.
Hbase and RocksDB are both K-value database.
If you want to search by the key columns,These 2 would be more faster

Create Input Format of Elasticsearch using Flink Rich InputFormat

We are using Elasticsearch 6.8.4 and Flink 1.0.18.
We have an index with 1 shard and 1 replica in elasticsearch and I want to create the custom input format to read and write data in elasticsearch using apache Flink dataset API with more than 1 input splits in order to achieve better performance. so is there any way I can achieve this requirement?
Note: Per document size is larger(almost 8mb) and I can read only 10 documents at a time because of size constraint and per reading request, we want to retrieve 500k records.
As per my understanding, no.of parallelism should be equal to number of shards/partitions of the data source. however, since we store only a small amount of data we have kept the number of shard as only 1 and we have a static data it gets increased very slightly per month.
Any help or example of source code will be much appreciated.
You need to be able to generate queries to ES that effectively partition your source data into relatively equal chunks. Then you can run your input source with a parallelism > 1, and have each sub-task read only part of the index data.

Indexing process in Hadoop

could any body please explain me what is meant by Indexing process in Hadoop.
Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure.
So it will be an additional space in the Cluster.
Googled around this topic but could not get any satisfactory and detailed things.
Any pointers will help.
Thanks in advance
Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.
However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing.
Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file.
We can apply 2 types of indexing for the mentioned case -
1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.
Now, to for implementing InputSplits index we need to perform following steps:
Build index from your full data set - This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
Get the InputSplit(s) for the indexed value you are looking for - Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
Execute your actual MapReduce job on indexed InputSplits only. - This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using -
To Use our custom IndexFileInputFormat In Driver class we need to provide
job.setInputFormatClass(IndexFileInputFormat.class);
For Code Sample and other details Refer this -
https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/
We can identify 2 different levels of granularity for creating indices: Index based on File URI or index based on InputSplit. Let’s take 2 different examples of data set.
index
First example:
2 files in your data set fit in 25 blocks, and have been identified as 7 different InputSplits. The target you are looking for (grey highlighted) is available on file #1 (block #2,#8 and #13), and on file #2 (block #17)
With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query
With InputSplit based indexing, you will end up with 4 InputSplits on 7 available. The performance should be definitely better than doing a full scan query
index
Let’s take a second example:
This time the same data set has been sorted by the column you want to index. The target you are looking for (grey highlighted) is now available on file #1 (block #1,#2,#3 and #4).
With File based indexing, you will end up with only 1 file from your data set
With InputSplit based indexing, you will end up with 1 InputSplit on 7 available
For this specific study, I decided to use a custom InputSplit based index. I believe such approach should be quite a good balance between the efforts it takes to implement, the added value it might bring in term of performance optimization, and its expected applicability regardless to the data distribution.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Resources