Match mass data records within thousands of queries - hadoop

I have a lot of data records.(about 1.5billion) And a lot of queries.(about 10 thousands).
Each record can be matched with multi queries.(It can be determined by eval the query against the data record)
The records were stored in a distributed database. Each record have a field to store the IDs of the queries that match this data record.
I can scan all the records in about 15 minutes.(But do nothing with the data).
For each record. I want to mark it with the query ID that it matches. Without a big delay(eg: 1hour). Is there a good algorithm to do this ? Iterate each query over the queries is not a solution. I think some kind of indexing is need. Please help! Thanks!

Apache Pig has multiquery switched on by default. If your queries share the same datasource, than Pig will execute them in parallel, so that input data is read only once.

Related

Dynamodb total record count for Pagination

I am planning to leverage AWS DynamoDB for one the legacy application. I have did the data modelling for persist the data in DDB and I have came with single table, as it is coming to effective in my use case.
But, there is one of the requirement where I need to show the total qualified record count for a Query for Pagination.
Apart of Scanning the whole table, is there any out of box to to get total qualified record counts?
Thanks
You can use describe table API for that.
It will return several json values including ItemCount which you
need.
This might be not 100% updated as of its no-sql nature. They update it after every ~6 hours. If you need live count, you have to scan entire table but scan is also eventually consistent operation.
If your question is about count on the basis of some condition then
no, you have to use scan or query depends how you want to implement
conditions
more details
https://docs.aws.amazon.com/cli/latest/reference/dynamodb/describe-table.html

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

Hive Bucketing - How to run hive query for specific bucket

I have hive query which reads 5 large tables and outputs the records to next process. All these tables are partitioned on proc_dt and bucketed on user_id (5 buckets). Joins are done on user_id and filtering on proc_dt.
How can I run this query for specific bucket of all the tables? For ex. I want to run the query for just first bucket of all tables.
The reason behind doing this is, once I complete the query for first bucket, I can send the output data to next process. While next process is running I can complete query for next bucket and so on. This way next process is not waiting for entire query to finish.
If I had one more column which had Mod5 of user ID, then I would have gone for partitioning. But there is no such column and I cannot add it.
Could anyone please give me some solution for this. Any suggestions will be really helpful.
I got the answer for it. We can mention the bucket number in join query. Check the below link for more detail.
https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/
You can specify partitions within query statements but not buckets. Buckets are used for optimization purposes - e.g. faster sampling and mapside joins. But they are not visible to sql statements.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
So here is the documentation example:
CLUSTERED BY(user_id) INTO 256 BUCKETS;
This clearly does not permit access to individual buckets by value/name.

Compound rowkey in Azure Table storage

I want to move some of my Azure SQL tables to Table storage. As far as I understand, I can save everything in the same table, seperating it using PartitionKey and keeping it unique within each partition using Rowkey.
Now, I have a table with a compound key:
ParentId: (uniqueidentifier)
ReportTime: (datetime)
I also understand RowKeys have to be strings. Will I need to combine these in a single string? Or can I combine multiple keys some other way? Do I need to make a new key perhaps?
Any help is appreciated.
UPDATE
My idea is to put data from several (three for now) database tables and put in the same storage table seperating them with the partition key.
I will query using the ParentId and a WeekNumber (another column). This table has about 1 million rows that's deleted weekly from the db. My two other tables has about 6 million and 3.5 million
This question is pretty broad and there is no right answer.
The specific question - can you use Compound Keys with Azure Table Storage. Yes, you can do that. But this involves manual Serializing / Deserializing of your object's properties. You can achieve that by overriding the TableEntity's ReadEntity and WriteEntity methods. Check this detailed blog post on how can you override these methods to use your own custom serialization/deserialization.
I will further discuss my view on your more broader question.
First of all, why you want to put data from 3 (SQL) tables into one (Azure Table)? Just have 3 Azure tables.
Second thought, as Fabrizio points out is how are you going to query the records. Because Windows Azure Table service has only one index, and that is PartitionKey + RowKey properties (columns). If you are pretty sure you will mostly query data by known PartitionKey and RowKey, then Azure Tables is perfectly suiting you! However you say that your combination for RowKey is ParentId + WeekNumber! That means that a record is uniquely identified by this combination! If it is true, then you are even more ready to go.
Next you say you are going to delete records every week! You should know that DELETE operation acts on a single entity. You can use Entity Group Transactions to DELETE multiple entities at once, but there is a limit of (a) All entities in batch operation must have the same PartitionKey, (b) The maximum number of entities per batch is 100, and (c) The maximum size of batch operation is 4MB. Say you have 1M records like you say. In order to delete them, you have to first retrieve them in groups by 100, then delete in groups by 100. These are, in best possible case 10k operations on retrieval and 10k operations on deletion. Event if it will only cost 0.002 USD, think about time taken to execute 10k operations against a REST API.
Since you have to delete entities on a regular basis, which is fixed to a WeekNumber let's say, I can suggest that you dynamically create your tables and include the week number in its name. Thus you will achieve:
Even better partitioning of information
Easier and more granular information backup / delete
Deleting millions of entities requires just one operation - delete table.
There is not an unique solution for your problem. Yes, you can use ParentID as PartitionKey and ReportTime as Rowkey (or invert the assignment). But the big 2 main questions re: how do you query your data, with what conditions? and how many data do you store? 1000, 1 million items, 1000 millions items? The total storage usage is important. But it's also very important to consider the number of transaction you will generate to the storage.

hbase filters - does it perform well

In my case,we defined the row key for the init set of queries, we are querying against the row key and leave the column family and columns alone.
eg. Row Key is something like:
%userid%_%timestamp%
we are doing some queries like
select columnFamily{A,B,C} from userid=blabla and blabla < timestamp < blabla
The performance is pretty ok, because that's what hbase is built for - row key look up.
But since the new requirement builds up, we will need to query against more fields: the columns. like:
select * from userid=blabla and blabla < timestamp < blabla and A=blabla and B=blabla and c=blabla
We started using hbase filters. We tried EqualFilter on one of the columns - A, it works ok from functionality point of view.
I have a general concern here, given the row key we have,
can we just keep adding filters against all columns A,B,C to meet different query needs? Does number of the filters added in the hbase query slow down the reading performance?
how dramatic is the impact if there is one?
Can somebody explain to me how we should use the best of hbase filters from performance perspective?
1) can we just keep adding filters against all columns A,B,C to meet different query needs? Does
number of the filters added in the hbase query slow down the reading performance?
Yes you can do this. It will affect performance depending on the size of the data set and what filters you are using.
2) how dramatic is the impact if there is one?
The less data you return the better. You don't want to fetch data that you don't need. Filters help you return only the data that you need.
3) Can somebody explain to me how we should use the best of hbase filters from performance perspective?
It is best to use filters such as prefix-filters, filters that match exactly a specific value (or qualifier, column, etc), or does something like a greater-than/less-than type comparison to the data. These types of filters do not need to look at all the data in each row or table to return the proper results. Avoid regex filters because the regex expression must be performed on every piece of data that the filter is looking at, and that can be taxing over a large data set.
Also, Lars George, the author of the HBase book, mentioned that people are moving more toward coprocessors than toward filters. Might also want to look at coprocessors.
1) can we just keep adding filters against all columns A,B,C to meet different query needs? Does
number of the filters added in the HBase query slow down the reading performance?
-Yes, you can add the filter for all columns but it will surely affect the performance of your query if you having huge data stored.
try to avoid the column filters because whenever you are adding any column filters ultimately you are increasing the number of comparisons based on columns.
2) how dramatic is the impact if there is one?
-Filter helps you to recuce your resultset , so you will have required data only while fetching.
3) Can somebody explain to me how we should use the best of hbase filters from performance perspective?
-In HBase rowFilter(it will include prefix) are most efficient filters because they don't need to look all record for that.So make your rowkey as it will include components on which you need to query frequently.
-Value filters are most inefficient filters because it have to compare the values of the columns.
-In HBase filters the sequence of filters matters, if you have multiple filters to be added to the filterlist then the sequence of the filters added will have impact on performance.
I will explain with example
If you need three different filters to be added to a query.Then when the first filter is applied the next filter will have the smaller data to be query on and after that same for third one.
So try to add efficient filter first ie.rowkey related filters and after that others

Resources