Increase scan performance in Apache Hbase - hadoop

I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks

As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..

Related

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

How Hbase handles duplicate records?

I want to understand how Hbase internally handles duplicates records from a file.
In order to experiment this, I have created an EXTERNAL table in hive with HBase specific configuration properties like table properties, SERDE, column family.
I have to create the table in HBase with column family as well, which I did.
I have performed an insert overwrite into this HIVE table from a source table which has duplicate records.
By duplicate records I mean like this,
ID | Name | Surname
1 | Ritesh | Rai
1 | RiteshKumar | Rai
Now after performing insert overwrite, I queried my HIVE table with id 1, I got the output as (the second one)
1 RiteshKumar Rai
I wanted to under how HBase decides which one is updated? Is it just that it just writes the data in a sequential manner. The last record will be overwritten in and considered as latest? Or how it is?
Thanks in advance.
Regards,
Govind
You are on the right track!
HBase datamodel can be seen as a 'multidimensional map' and each cell value is associated with a timestamp (insertion_time by default):
row:column_family:column_qualifier:timestamp:value
NOTE: The timestamp is associated with each single value and not the entire row (This enables several nice features)!
At read time you will get the latest versions by default unless you specify otherwise. By default 3 versions should be stored. Hbase does a 'merge read' and it will return the latest cell value for each row.
Please try this from your hbase-shell (not really tested before posting):
put ‘table_name’, ‘1’, ‘f:name’, ‘Ritesh’
put ‘table_name’, ‘1’, ‘f:surname’, ‘Rai’
put ‘table_name’, ‘1’, ‘f:name’, ‘RiteshKumar’
put ‘table_name’, ‘1’, ‘f:surname’, ‘Rai’
put ‘table_name’, ‘1’, ‘f:other’, ‘Some other stuff’
// Data on 'disk' (that might just be the memstore for now) will look like this:
// 1:f:name:1234567890:‘Ritesh’
// 1:f:surname:1234567891:‘Rai’
// 1:f:name:1234567892:‘RiteshKumar’
// 1:f:surname:1234567893:‘Rai’
// 1:f:other:1234567894:‘Some other stuff’
// Now try... And you will get ‘RiteshKumar’, ‘Rai’, ‘Some other stuff’
get ‘table_name’, ‘1’
// To get the previous versions of the data use the following:
get ‘table_name’, ‘1’, {COLUMN => ‘f’, VERSIONS => 2}
Don't forget to take a look at the best practices of schema design

HBase row key design for reads and updates

I'm try to understand the best way to design the key for my HBase Table.
My use case :
Structure right now
PersonID | BatchDate | PersonJSON
When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.
If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.
If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.
Is bloom filter on row+col (personid+batchdate) an option ?
Any help is appreciated.
Thanks,
Abhishek
In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index, with BatchDate as the rowkey.
Another option would be Apache Phoenix, which provides support for secondary indexes.
I usually do two steps:
Create table one just have key is commbine of BatchDate+PersonId, value could be empty.
Create table two just as normal you did. Key is PersonId Value is the whole data.
For date range query: query table one first to get the PersonIds, and then use Hbase batch get API to get the data by batch. it would be very fast.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Simplifying a Cascading pipeline used for aggregating sales data

I'm very new to Cascading and Hadoop both, so be gentle... :-D
I think I'm finding myself way over-engineering something. Basically my situation is that I have a pipe delimited file with 9 fields. I want to compute some aggregated statistics over those 9 fields using different groupings. The result should be 10 fields of which only 6 are either counts or sums. So far I'm up to 4 Unique pipes, 4 CountBy pipes, 1 SumBy, 1 GroupBy, 1 Every, 2 Each, 5 CoGroups and a couple others. I'm needing to add another small piece of functionality and the only way I can see to do it is to add in 2 Filters, 2 more CoGroups and 2 more Each pipes. This all seems like way overkill just to compute a few aggregated statistics. So I'm thinking I'm really misunderstanding something.
My input file looks like this:
storeID | invoiceID | groupID | customerID | transaction date | quantity | price | item type | customer type
Item type is either "I", "S" or "G" for inventory, service or group items, customers belong to groups. The rest should be self-explanatory
The result I want is:
project ID | storeID | year | month | unique invoices | unique groups | unique customers | customer visits | inventory type sales | service type sales |
project ID is a constant, customer visits is how many days during the month the customer came in and bought something
The setup that I'm using right now uses a TextDelimited Tap as my source to read the file and passes the records to an Each pipe which uses a DateParser to parse the transaction date and adds in year, month and day fields. So far so good. This is where it gets out of control.
I'm splitting the stream from there up into 5 separate streams to process each of the aggregated fields that I want. Then I'm joining all the results together in 5 CoGroup pipes, sending the result through Insert (to insert the project ID) and writing through a TextDelimited sink Tap.
Is there an easier way than splitting into 5 streams like that? The first four streams do almost the exact same thing just on different fields. For example, the first stream uses a Unique pipe to just get unique invoiceID's then uses a CountBy to count the number of records with the same storeID, year and month. That gives me the number of unique invoices created for each store by year and month. Then there is a stream that does the same thing with groupID and another that does it with customerID.
Any ideas for simplifying this? There must be an easier way.

Resources