Biquery table data performance - performance

In BigQuery I have a table storing 237 GB data. I don't have any columns on which i can create partition as it does not store any date fields
When I am using it in the query the processing says 77 GB data will be processed but in bytes shuffled i see 7 GB data.
what is the actual GB of data processed here?
is there any way i could restructure this table ?

BigQuery operates column-wise. If you only choose the columns you really need in a query then you're optimizing cost already. Traditionally databases operate row-wise, so this can be a bit counter-intuitive.
There's also this great blog article on optimizing for costs.

Related

All else held equal, which is the faster querying option: Milvus, RocksDB, or Apache HBase

I have a requirement to store billions of records (with capacity up to one trillion records) in a database (total size is in terms of petabytes). The records are textual fields with about 5 columns representing transactional information.
I want to be able to query data in the database incredibly quickly, so I was researching Milvus, Apache HBase, and RocksDB. Based on my research, all three are incredibly fast and work well with large amounts of data. All else equal, which of these three is the fastest?
What type of data are you storing in the database?
Milvus is used for vector storage and computation.
If you want to search by the semantics of the text, milvus is the fastest option.
Hbase and RocksDB are both K-value database.
If you want to search by the key columns,These 2 would be more faster

Nifi Hbase data insertion taking more space than original data

I am doing data transformation in realtime using Nifi and after processing data is stored in Hbase. I am using puthbasejson for storing the data in hbase. While storing row key/id i am using is uuid. But the original data size in nifi data provonance or in online tool for a single JSON is 390bytes. But for 15 million data the size which it is taking 55 GB, according to which the data size for single record is 3.9 KB.
So, I am not getting how the data is stored, why the data size which is stored in hbase is more than the original data size and how I can reduce or optimize both in Hbase and Nifi(if any changes required).
JSON:
{"_id":"61577d7aba779647060cb4e9","index":0,"guid":"c70bff48-008d-4f5b-b83a-f2064730f69c","isActive":true,"balance":"$3,410.16","picture":"","age":40,"eyeColor":"green","name":"Delia Mason","gender":"female","company":"INTERODEO","email":"deliamason#interodeo.com","phone":"+1 (892) 525-3498","address":"682 Macon Street, Clinton, Idaho, 3964","about":"","registered":"2019-09-03T06:00:32 -06:-30"}
Steps to reproduce in nifi:
generate flowfile--->PuthbaseJSON(uuid rowkey)
Update1:
data stored in hbase:
I think the main thing you may be getting surprised by is that Hbase stores each column of a table as an individual record.
Suppose your UUID is 40 characters on average, field 1, 2 and 3 may each be 5 on average and perhaps it adds a timestamp of length 15.
Now originally you would have an amount of data of size 40+5+5+5+15 = 70
And after storing per row as per your screenshot, with three columns it would become 3*(40+5+15)=180 and this effect can increase if you have smaller or more fields.
I got this understanding from your screenshot but also from this article: https://dzone.com/articles/how-to-improve-apache-hbase-performance-via-data-s
Now the obvious way forward if you want to reduce your footprint, is to reduce the overhead. I believe the article recommends serialization, but perhaps it would also simply be possible to put the entire json body into one column, depending on how you plan to access it.

How do we optimise the spark job if the base table has 130 billions of records

We are joining multiple tables and doing complex transformations and enrichments.
In that the base table will have around 130 billions of records, how can we optimise the spark job when the spark filters all the records keep in memory and do the enrichments with other left outer join tables. Currently spark job is running for more than 7 hours, can you suggest some techniques
Here is what you can try
Partition your base tables on which you want to run your query, create partition on specific column like Department, or Date etc which you use during joining. If the under lying table is hive you can also try bucketing.
Try optimised joins which suits your requirement such sorted merge join, hash join.
File format, use parquet file format as it much faster compared to ORC for analytical queries, and it also stores data in columnar format.
If your query has multiple steps and some steps are reused try to use caching, as spark supports memory and disk caching.
Tune your spark jobs by specifying the number of partitions, executor, cores, driver memory as per the resources available. Check spark history UI to understand how data is distributed. Try various configurations see what works best for you.
Spark might perform poorly if there large skewness in data. if that is the case you might need further optimisation to handle it.
Apart from the above mentioned techniques, you can try below option as well to optimize your job.
1.You can partition your data by inspecting your data fields. Most common columns that are used for partitioning are like date columns, region ID, country code etc.Once data is partitioned your can explain your dataframe like df.explain() and see if is using PartitioningAwareFileIndex.
2.Try tuning the spark settings and cluster configuration to scale with the input data volume.
Try changing the spark.sql.files.maxPartitionBytes to 256 MB or 512
MB , we have see significant performance gain by changing this
parameter.
Use appropriate number of executor , cores & executor memory based on
compute need
Try analyzing the spark history to identify the stage jobs which are
consuming significant time. This would be good point to start
debugging your job.

How to deal with hive partitioning for performance versus over-partitioning

We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.
Performance is poor for a number of reasons. Nearly all of our queries can be further qualified by customer_id, as well, but we have 500 customers and growing quickly. If we narrow the query to a given month, we still need to scan all records just to find the records for one customer. The data is stored as Parquet now, so the main performance issues are not related to scanning all of the contents of a record.
We hesitated to add a partition on customer because if we have 120 year-month partitions, and 500 customers in each this will make 60K partitions which is larger than Hive metastore can effectively handle. We also hesitated to partition only on customer_id because some customers are huge and other tiny, so we have a natural data skew.
Ideally, we would be able to partition historical data, which is used far less frequently using one rule (perhaps year + customer_id) and current data using another (like year/month + customer_id). Have considered using multiple datasets, but managing this over time seems like more work and changes and so on.
Are there strategies, or capabilities of Hive that provide a way to handle a case like this where we "want" lots of partitions for performance, but are limited by the metastore?
I am also confused about the benefit of bucketing. A suitable bucketing based on customer id, for example, would seem to help in a similar way as partitioning. Yet Hortonworks "strongly recommends against" buckets (with no explanation why). Several other pages suggest bucketing is useful for sampling. Another good discussion of bucketing from Hortonworks indicates that Hive cannot do pruning with buckets the same way it can with partitions.
We're on a recent version of Hive/Hadoop (moving from CDH 5.7 to AWS EMR).
In real 60K partitions is not a big problem for Hive. I have experience with about 2MM partitions for one Have table and it works pretty fast. Some details you can find on link https://andr83.io/1123 Of course you need write queries carefully. Also I can recommend to use ORC format with indexes and bloom filters support.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Resources