I would like to create an optimal row key in Bigtable. I have a table channel_data with 3 columns: channel_id,date,fan_count.
channel_id
date
fan_count
1
2022-03-01
5000
1
2022-03-02
6000
2
2022-03-01
200
2
2022-03-02
300
3
2022-03-03
1000
Users of our application can set up brands/buckets by adding multiple channels. Users can choose any random channel_id.
I want to design an efficient row key to fetch aggregated fan_count in a date range for a brand.
Let's say the user creates a brand with channel_id 1 and 3 and wish to see sum of all fans for the time period 2022-03-01 to 2022-03-03
The result should be 5000+6000+1000=12000
You have a few options here. Because you're looking to do queries based on date, you should probably make that the end part of your rowkey so you can scope down by brand first. You could also use timestamped cells to store multiple values for each channel. Perhaps a week or month of data, so it is grouped together in that way, but this isn't necessary.
Perhaps a rowkey like channel_id/yyyy-mm-dd is what you'd want. You can choose to store the date and channel info in the table, but it isn't necessary since you'd have it in your ids. You can just treat Bigtable like a key/value store in this instance which might be more optimal depending on your scenario.
If you choose to store a month of data per row, you would just make the rowkey something like channel_id/yyyy-mm and just timestamp each value for the day.
Either way for your queries, if you need multiple channels, then you could just do multiple reads or a multi-prefix scan. Let me know if this helps clarify the schema design and if you have more questions.
Related
For a client I am creating a data warehouse in which we have some slowly changing dimensions (or facts if that is even a thing?). For example we want to report the annually recurring revenue (ARR) for subscriptions and we want to have both the currently active and the expired subscriptions in there. So that we can see the ARR over a timeline.
The data we retrieve looks like this:
subscription_id
account_id
ARR
start_date
end_date
1
1
10
01-01-2022
31-03-2022
2
2
20
01-01-2022
31-12-2022
3
1
5
01-04-2022
31-11-2022
So in this case the same account (account_id 1) renewed a subscription at the 01-04-2022. In the report of 2022 we want to see the ARR for all months in 2022. I've looked into slowly changing dimensions, however something I can not really see in that concept is how to report both the currently active license and the history in a dashboard. If we for example want to visualize the ARR in all of 2022 per month in a dashboarding tool we want to see both subscriptions for account_id 1 over the course of the year, not just the currently active one. This seems to be very tricky to do in most dashboarding tools.
To overcome this I've done the following. I created a calendar table with an interval of 1 month and I cross join it with the table above to generate a fact table. The end result would look like:
timestamp
account_id
ARR
01-01-2022
1
10
01-01-2022
2
20
01-02-2022
1
10
...
...
...
01-11-2022
1
10
01-11-2022
2
20
01-11-2022
2
20
This makes it really easy for the user of the reporting tool to filter on a specific month and show the ARR between the dates and over multiple subscriptions. It does however generate a lot of extra data, but at the moment the storage space is not an issue. And it makes it more of a transactional style table, but the ARR is not really a transaction (i.e. it is not really a sold product on a specific date).
My question is: Are there better ways of generating a fact table where the source data contains a date range?
I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks
As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..
I'm new on Redis, and now I have problem to improve my stat application. The current SQL to generate the statistic is here:
SELECT MIN(created_at), MAX(created_at) FROM table ORDER BY id DESC limit 10000
It will return MIN and MAX value from created_at field.
I have read about RANGE and SCORING on Redis, seem them can be used to solve this problem. But I still confused about SCORING for last 10000 records. Are they can be used to solve this problem, or is there another way to solve this problem using Redis?
Regards
Your target appears to be somewhat unclear - are you looking to store all the records in Redis? If so, what other columns does the table table have and what other queries do you run against it?
I'll take your question at face value, but note that in most NoSQL databases (Redis included) you need to store your data according to how you plan on fetching it. Assuming that you want to get the min/max creation dates of the last 10K records, I suggest that you keep them in a Sorted Set. The Sorted Set's members will be the unique id and their scores will be the creation date (use the epoch value), for example, rows with ids 1, 2 & 3 were created at dates 10, 100 & 1000 respectively:
ZADD table 10 1 100 2 1000 3 ...
Getting the minimal creation date is easy now - just do ZRANGE table 0 0 WITHSCORES - and the max is just a ZRANGE table -1 -1 WITHSCORES away. The only "tricky" part is making sure that the Sorted Set is kept updated, so for every new record you'll need to remove the lowest id from the set and add the new one. In pseudo Python code this would look something like the following:
def updateMinMaxSortedSet(id, date):
count = redis.zcount('table')
if count > 10000:
redis.zrem('table', id-10000)
redis.zadd('table', id, date)
I have to create a matrix in SSRS to detail the number uses leaving an organisation.
The columns will all represent spaces of time spanning 1 week and the rows will all represent departements in the organisation. The detail portion will be a count of people who have left that area in that week.
I have a leaving date field in the DB but nothing that flags the specific intevals I have been told to use. That means that as the matrix is, it counts each of users that have left a specific department however the date range columns is 1 day, not 1 week. Is there a way to force the column headers to respect the week intervals I want given that they are currently coming from the dataset and are not hard coded?
Firstly try to manage your data in sql itself by using Group By with date and making each group as one week period. That way you can manage to get all data in your required format
I don't know what is your columns so I am just showing a way to get the week groups from table and get the count of the people
SELECT DATEPART(wk, datevaluecolumn) weekno
, SUM(peopleleavingcolumn) totalvalue
FROM yourTable
GROUP BY DATEPART(wk, datevalue)
I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.