Looking for delta table indexes as my primary key - performance

I am brand new for databricks, and Detla tables are presented to me having 4 main features (QRPG)
Quality
Reliable by having ACID transactions
Performance by having index
Governance by using Table ACL, and unity catalog
I want to be able to use this generated indexes at least as my primary key, but by far I could not find anything to see or have access to this indexes that are improving performance.
Please kindly help

It's really depends on what you mean under "able to use this generated indexes". There are few features on Delta that fit under an umbrella of "indexing":
Data skipping - ability to store min/max statistics in the Delta table transaction log, so when you're reading the data, Delta will skip files that doesn't contain a specific value. In combination with OPTIMIZE ... ZORDER BY it allows better skipping of the data as related data is stored closed together. Data skipping works best with numeric & date/time columns, and short strings. But it may not work well when you search for values that fit into the range of min/max. Like, if your file have min of 0 and max of 10, and you search for value 5, data skipping won't help, and you need to read file to find if you have data with value of 5.
Bloom filters - this is a closer to the "traditional indexing", as for each file there will be an additional data structure that will allow to check if your value is definitely not in the file, or maybe is in the file. Bloom filters allow to skip file reading more efficiently as it's checking for specific values.
I believe that in your case, bloom filters could be a best fit if you search for "primary key".

Related

Filter a Data Source from a Different Data Source

I have two chart tables both with different data sources. I want one table to act as the filter to the other table.
Here is the problem...
I tried a custom query for my data source which used the email parameter to filter the data source.
The problem is every time a user changes a filter on any page a query is executed in BigQuery, slowing the results and exponentially increasing my BigQuery monthly charges.
I tried blending the two tables.
The problem is the blended data feature only allows for 10 dimensions to be added to the resulting blended data source and is very slow.
I tried creating a control filter using a custom field on the "location" column on each table sharing the same "Field Id".
The problem is that the results table returns all the stores until you click on a location in the control list. And I cannot let a user see other locations.
Here is a link to a data studio sample report you can clearly see what I am trying to do.
https://datastudio.google.com/reporting/dd33be45-ab13-4881-8a3b-cabafa8c0dbb
Thanks
One solution which i can recommend to over come your first challenge, i.e. High cost. You can customize cost by using GCP-Memorystore, depending on frequency of data that is getting updated.
Moreover, Bigquery also cashes data for a query if you are not using Wild cards on tables and Time partitioned tables. So try to customize your solution over analysis cost if it is feasible over your solution. Bigquery Partition and Clusting may also help you in reducing BQ analysis cost.

Hbase time-series data format: Using composite key vs. Using versioning with timestamp

I like to store the log of byte counters for 10 Million LAN devices.
Each device reports byte counter value every 15 minutes (96 samples/day), and each data sample has 500 columns. Each device is identified by its device serial dev_sn.
At the end of day, I will process the data (compute the total byte per device) for all the devices and store them into HIVE data format.
The raw data would be like this:(ex. Device sn1,sn2,and sn3 report values at t1,t2,and t3)
Method 1: Use both dev_sn and timestamp as the composite row-key.
Method 2: Use dev_sn as the row-key and store each data as the version update of the existing values.
To find the total bytes,
Method 1: Search by sn1 for composite key and sort by time and process the data
Method 2: Search by sn1 and pull all the versions and process the data
I think Method 2 is a better solution as it will create less number of row-keys, but not sure if that is really the better approach. Some advice would really helpful.
This is subjective, but I always opt for a composite row key over versioning, for the following reasons:
You can store unlimited "versions" per device. With versioning, this property is limited (as set in configuration).
It's much easier to retrieve entries from specific timestamps/time ranges with an HBase command. Prefix scans are much easier to work with than the version API.
There's no reason for you to want to reduce the number of row keys - HBase is designed specifically to store huge numbers of row keys.
What if you need to delete last Tuesday's data? With versioning that's difficult, with composite keys it's a small piece of code.
As an aside, be sure to pre-split your region servers so that the dev_sn values distribute evenly.

Load data from S3 to sort and allow timeline analysis

I'm currently trying to find out the best architecture approach for my use case:
I have S3 buckets (two totally separated) which contains data stored in JSON format. Data is partitioned by year/month/day prefixes, and inside particular day I can find e.g hundreds of files for this date
(example: s3://mybucket/2018/12/31/file1,
s3://mybucket/2018/12/31/file2, s3://mybucket/2018/12/31/file..n)
Unfortunately inside particular prefix for single day, in those tens..or hundreds files JSONs are not ordered by exact timestamp - so if we follow this example:
s3://mybucket/2018/12/31/
I can find:
file1 - which contains JSON about object "A" with timestamp "2018-12-31 18:00"
file100 - which contains JSON about object "A" with timestamp "2018-12-31 04:00"
What even worse...the same scenario I have with my second bucket.
What I want to do with this data?
Gather my events from both buckets, ordered by "ID" of object, in a sorted way (by timestamp) to visualize that in timeline at last step (which tools and how it's out of scope).
My doubts are more how to do it:
In cost efficient way
Cloud native (in AWS)
With smallest possible maintenance
What I was thinking of:
Not sure if...but loading every new file which arrived on S3 to DynamoDB (using Lambda triggered). AFAIK Creating table in proper approach - my ID as Hask key and timestamp as Range Key should works for me, correct?
As every new row inserted will be partitioned to particular ID, and already ordered in correct manner - but I'm not an expert.
Use Log-stash to load data from S3 to ElasticSearch - again AFAIK everything in ES can be indexed, so also sorted. Timelion will probably allow me to do those fancy analysis I need to created. But again....not sure if ES will perform as I want...price...volume is big etc.
??? No other ideas
To help somehow understand my need and show a bit data structure I prepared this: :)
example of workflow
Volume of data?
Around +- 200 000 events - each event is a JSON with 4 features (ID, Event_type, Timestamp, Price)
To summarize:
I need put data somewhere effectively, minimizing cost, sorted to maintain at next step front end to present how events are changing based on time - filtered by particular "ID".
Thank and appreciate for any good advice, some best practices, or solutions I can rely on!:)
#John Rotenstein - you are right I absolutely forgot to add those details. Basically I don't need any SQL functionality, as data will be not updated. Only scenario is that new event for particular ID will just arrive, so only new incremental data. Based on that, my only operation I will do on this dataset - is "Select". That's why I would prefer speed and instant answer. People will look at this mostly per each "ID" - so using filtering. Data is arriving every 15 minut on S3 (new files).
#Athar Khan - thanks for good sugestion!
As far as I understand this, I would choose the second option of Elasticsearch, with Logstash loading the data from S3, and Kibana as the tool to investigate, search, sort and visualise.
Having lambda pushing data from s3 to DynamoDB would probably work, but might be less efficient and cost more, as you are running a compute process on each event, while pushing to Dynamo in small/single-item bulks. Logstash, on the other hand, would read the files one by one and process them all. It also depends on how ofter do you plan to load fresh data to S3, but both solution should fit.
The fact that the timestamps are not ordered in the files would not make an issue in elasticsearch and you can index them on any order you can, you would still be able to visualise and search them in kibana in a time based sorted order.

HBASE versus HIVE: What is more suitable for data that is uniquely defined by multiple fields?

We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.

hbase filters - does it perform well

In my case,we defined the row key for the init set of queries, we are querying against the row key and leave the column family and columns alone.
eg. Row Key is something like:
%userid%_%timestamp%
we are doing some queries like
select columnFamily{A,B,C} from userid=blabla and blabla < timestamp < blabla
The performance is pretty ok, because that's what hbase is built for - row key look up.
But since the new requirement builds up, we will need to query against more fields: the columns. like:
select * from userid=blabla and blabla < timestamp < blabla and A=blabla and B=blabla and c=blabla
We started using hbase filters. We tried EqualFilter on one of the columns - A, it works ok from functionality point of view.
I have a general concern here, given the row key we have,
can we just keep adding filters against all columns A,B,C to meet different query needs? Does number of the filters added in the hbase query slow down the reading performance?
how dramatic is the impact if there is one?
Can somebody explain to me how we should use the best of hbase filters from performance perspective?
1) can we just keep adding filters against all columns A,B,C to meet different query needs? Does
number of the filters added in the hbase query slow down the reading performance?
Yes you can do this. It will affect performance depending on the size of the data set and what filters you are using.
2) how dramatic is the impact if there is one?
The less data you return the better. You don't want to fetch data that you don't need. Filters help you return only the data that you need.
3) Can somebody explain to me how we should use the best of hbase filters from performance perspective?
It is best to use filters such as prefix-filters, filters that match exactly a specific value (or qualifier, column, etc), or does something like a greater-than/less-than type comparison to the data. These types of filters do not need to look at all the data in each row or table to return the proper results. Avoid regex filters because the regex expression must be performed on every piece of data that the filter is looking at, and that can be taxing over a large data set.
Also, Lars George, the author of the HBase book, mentioned that people are moving more toward coprocessors than toward filters. Might also want to look at coprocessors.
1) can we just keep adding filters against all columns A,B,C to meet different query needs? Does
number of the filters added in the HBase query slow down the reading performance?
-Yes, you can add the filter for all columns but it will surely affect the performance of your query if you having huge data stored.
try to avoid the column filters because whenever you are adding any column filters ultimately you are increasing the number of comparisons based on columns.
2) how dramatic is the impact if there is one?
-Filter helps you to recuce your resultset , so you will have required data only while fetching.
3) Can somebody explain to me how we should use the best of hbase filters from performance perspective?
-In HBase rowFilter(it will include prefix) are most efficient filters because they don't need to look all record for that.So make your rowkey as it will include components on which you need to query frequently.
-Value filters are most inefficient filters because it have to compare the values of the columns.
-In HBase filters the sequence of filters matters, if you have multiple filters to be added to the filterlist then the sequence of the filters added will have impact on performance.
I will explain with example
If you need three different filters to be added to a query.Then when the first filter is applied the next filter will have the smaller data to be query on and after that same for third one.
So try to add efficient filter first ie.rowkey related filters and after that others

Resources