MAX() SQL Equivalent on Redis - performance

I'm new on Redis, and now I have problem to improve my stat application. The current SQL to generate the statistic is here:
SELECT MIN(created_at), MAX(created_at) FROM table ORDER BY id DESC limit 10000
It will return MIN and MAX value from created_at field.
I have read about RANGE and SCORING on Redis, seem them can be used to solve this problem. But I still confused about SCORING for last 10000 records. Are they can be used to solve this problem, or is there another way to solve this problem using Redis?
Regards

Your target appears to be somewhat unclear - are you looking to store all the records in Redis? If so, what other columns does the table table have and what other queries do you run against it?
I'll take your question at face value, but note that in most NoSQL databases (Redis included) you need to store your data according to how you plan on fetching it. Assuming that you want to get the min/max creation dates of the last 10K records, I suggest that you keep them in a Sorted Set. The Sorted Set's members will be the unique id and their scores will be the creation date (use the epoch value), for example, rows with ids 1, 2 & 3 were created at dates 10, 100 & 1000 respectively:
ZADD table 10 1 100 2 1000 3 ...
Getting the minimal creation date is easy now - just do ZRANGE table 0 0 WITHSCORES - and the max is just a ZRANGE table -1 -1 WITHSCORES away. The only "tricky" part is making sure that the Sorted Set is kept updated, so for every new record you'll need to remove the lowest id from the set and add the new one. In pseudo Python code this would look something like the following:
def updateMinMaxSortedSet(id, date):
count = redis.zcount('table')
if count > 10000:
redis.zrem('table', id-10000)
redis.zadd('table', id, date)

Related

I would like to create an efficient Bigtable row key

I would like to create an optimal row key in Bigtable. I have a table channel_data with 3 columns: channel_id,date,fan_count.
channel_id
date
fan_count
1
2022-03-01
5000
1
2022-03-02
6000
2
2022-03-01
200
2
2022-03-02
300
3
2022-03-03
1000
Users of our application can set up brands/buckets by adding multiple channels. Users can choose any random channel_id.
I want to design an efficient row key to fetch aggregated fan_count in a date range for a brand.
Let's say the user creates a brand with channel_id 1 and 3 and wish to see sum of all fans for the time period 2022-03-01 to 2022-03-03
The result should be 5000+6000+1000=12000
You have a few options here. Because you're looking to do queries based on date, you should probably make that the end part of your rowkey so you can scope down by brand first. You could also use timestamped cells to store multiple values for each channel. Perhaps a week or month of data, so it is grouped together in that way, but this isn't necessary.
Perhaps a rowkey like channel_id/yyyy-mm-dd is what you'd want. You can choose to store the date and channel info in the table, but it isn't necessary since you'd have it in your ids. You can just treat Bigtable like a key/value store in this instance which might be more optimal depending on your scenario.
If you choose to store a month of data per row, you would just make the rowkey something like channel_id/yyyy-mm and just timestamp each value for the day.
Either way for your queries, if you need multiple channels, then you could just do multiple reads or a multi-prefix scan. Let me know if this helps clarify the schema design and if you have more questions.

Faster select when filtering with second or third sorted column

We have a time series table with the following definition
CREATE TABLE timeseries.mytable
(
`ts` DateTime('UTC'),
`src_ip` String,
`dst_ip` String,
`col_other` String
)
ENGINE = MergeTree()
PARTITION BY toDate(tr)
ORDER BY (dst_ip,ts,src_ip)
SETTINGS index_granularity = 8192
SELECT count(*) FROM timeseries.mytable;
# Elapsed 0.004 sec. Has 383M records
SELECT count(*) FROM timeseries.timeseries WHERE dst_ip = 'a.b.c.d';
# Elapsed: 0.085 sec.
SELECT count(*) FROM timeseries.timeseries WHERE src_ip = 'a.b.c.d';
# Elapsed: 53.031 sec
As can be seen above, filtering the data using the first sorted column (dst_ip) is very quick.
How can I make the select using the third sorted column (src_ip) faster?
Some remarks:
the third query (WHERE src_ip = 'a.b.c.d') works slowly because of index is not used and CH uses full scan. No good way to make it faster besides as redesign the primary key or if this query calculates just aggregates use the additional AggregatingMergeTree-table
use-cases which you provided looks as artificial because the calculation of row count by all dataset is not key use-case for timeseries data. Why the result not restricted by dst_ip and ts?
consider using ClickHouse AggregatingMergeTree Approach when need to calculate aggregated-values (as count in your case)
design of primary key required the understanding as CH use it in query optimization (see Primary Keys and Indexes in Queries, More secrets of ClickHouse Query Performance)
it recommends using the monotonic index
to choose the best index need to make the series of tests to find the index fittest for concrete use-cases
I would suggest the next primary keys:
/* [pretty suspicious suggestion] Remove date-column (it makes much slower the all date range queries with a range less than Daily). */
ORDER BY (dst_ip, src_ip)
/* Define the granularity of date. Instead of toStartOfHour can be used any interval less than 'Daily' (where Daily is defined by partition key) */
ORDER BY (dst_ip, toStartOfHour(ts), src_ip)
/* Move the date to the first position (it makes faster queries with date range without dst_ip and get monotonic-index related advantages). */
ORDER BY (toStartOfHour(ts), dst_ip, src_ip)
For each primary key need to choose the more effective Index granularity-value.
As for 2022, the solution is to use Data Skipping Index https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes for src_ip
You should try testing by keeping different order in ORDER BY clause depending upon value cardinality of your columns. In this case, maybe trying bringing src_ip before ts in ORDER by class.
In MergeTree engine, rows are sorted on the basis of ORDER BY keys in each partition.
After that, you can decide the final arrangement of columns in ORDER by clause depending upon how your application will query data most of the items.
You can find a similar discussion here.

Cognos 11 Crosstab - need a value that doesn't have a reference to the column values

Crosstab report works 99%.
About 20 rows, all but one are ok.
5 columns - Company Division.
The rows are things like cost, revenue, revenue 2, etc.
All the rows that work have three attributes I'm using to select them:
Fiscal Year
Period
Solution.
The problem is there is table that lists an YTD rate for each period. This table is not Division Specific; it's company wide.
All the tables are linked to the accounting period table that has fiscal year and period. So the overall query limits data to fiscal year (?pFiscalYear?) and period <= ?pPeriod?, based on prompt page results.
The source table has this:
FY_CD PD_NO ACT_CURR_RT ACT_YTD_RT
2018 1 0.36121715 0.36121715
2018 2 0.32471476 0.34255512
2018 3 0.25240906 0.31210183
2018 4 0.33154745 0.31925874
Note the YTD rate is not an average of any of the other numbers.
When I select the ACT_YTD_RT, as a row, I want the ACT_YTD_RT that matches the selected period.
What I get is the average if I set the aggregation to average or the lowest if I set it to other aggregations. So sometimes, it looks right (if I run for period 1,2,3, as the rate kept falling), and sometimes it's wrong (period 4
returns .3121 instead of .3192).
I've tried a number of different methods and can generate garbage data (totals, min, max, average) and crossjoins but can't figure out how to get the value I'm looking for.
I want YTD_RT where fiscal year =?pFiscal? and period = ?pPeriod?.
I tried a straight if then clause:
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
but I get an error like this:
'ACT_YTD_RT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (SQLSTATE=42000, SQLERRORCODE=8120)
If I create another query that generates the right response and try to include it, I get a crossjoin error that the query I'm referencing is trying to crossjoin several other items in the crosstab query.
A union doesn't work (different number of columns).
Not sure how a join would work since the division doesn't exist in the rate table.
I maybe could create a view in the database that did a crossjoin of the division table and the rate table, add that to the framework and then I wouldn't have a crossjoin since the solution would be in the rate "table" (really view), but that seems wrong somehow.
If I could just write a freaking parameterized query direct to the database I'd be done. But in Cognos 11 crosstabs I can't find a place for a SQL query object. And that shouldn't be necessary.
I've spent hours and hours chasing this in circles.
Anybody have any ideas?
Thanks
Paul
So the earlier problem was that this:
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
Generated an error like this:
'ACT_YTD_RT' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. (SQLSTATE=42000, SQLERRORCODE=8120)
To fix the above, I had to add a cross join of the division table and the rate table as a view in the database. Then add that to the framework. Then build the data item this way:
total (
if (sourcetable.fiscalYear = ?pFiscalYear?) and (sourcetable.Period = ?pPeriod?) then (ACT_YTD_RT)
)
And now the "total" provides the missing group by. And the crossjoin in the database provides the division information so the crosstab is happy.
I still think there should have been an easier way to do this, but I have a functioning hammer at the moment.

JPA best way to avoid n+1 when I need to make a calculation for each row

My application is used to find place in a city. Each place needs a score to be calculated and this score cannot be predicted in advance (stored somewhere) as it is different for each user and changes over time. Here is was i'm at the moment doing an that is TERRIBLY inneficient (15 times slower than if I mock the database call inside the loop)
SQL(native) query to fetch all the places that matches the search (I select all the column I need specifically)
I loop through the List and for each poi I make a db call to get the info needed to calculate the scores (I need different value residing on different tables)
make the calculation
sort by score desc
cut the list depending on the pagination setting (yes I cannot put LIMIT directly in the query as i don't know the score yet....)
return the List.
Well, this takes 15 seconds in total.
If I remove 2. and simply mock the DB call it only takes 600ms..
my table looks like this:
place_tag_count table:
place_id / tag_id / tag_count
1 100 15
1 200 25
1 300 35
user_tag_score Table:
user_id / tag_id / score
1000 100 0.5
1000 200 0.3
as a simplified example the place score is the sum of the user's tag score multiplied by the tag count found in the place_tag_count
score = 0.5 * 15 + 0.3 * 25 + (i won't complicate the thing but if a tag score is missing i do other calculation that need other db calls....)
the query at 1. returns a distinct place so because the calculation needs all the counts from of the tag's place and the user's tag score I need to make that extra DB call for each poi.
My question is, what would be the BEST way to avoid having n+1 call in my situation? I have thought to some alternative but I prefer having the opinion of a more experience person before going head first.
Instead of returning a distinct place in the query in 1) I actually return the same place grouped by place_id,tag_id for example, and in my Java code I just loop and when I see that the place_id change it means I'm processing an other place
make the query in 1. a bitttt more complicate and aggregate all the numbers i need in a comma separated list )but that requires some kind of sub select which might affect the speed of the query)
other solution ?

Splunk argmax: get field value corresponding to max value of another field

Let's say on Splunk, I have a table with the fields 'month', 'year', and 'count'. I want the month corresponding to the max count for each year. So, the resulting table should only have one month per year.
I've tried using the stats and chart max functions, but I can't figure out how to use them to get what I want, or if it's even possible.
Is there any way to accomplish this using Splunk?
I ended up using the streamstats command.
Given a table with fields month,year, and count,
<some search>
| streamstats max(count) as mc by year
| sort +year, -count
| streamstats first(mc) as mc
| where count = mc
Essentially, I'm using streamstats to max across each month in each year, storing a running max for each entry as a new column. Then, I sort it so that the largest max count is at the top of each year group, so that I can then select the first one as the max entry.
I also had the same requirement.
I had log data with the fields 'loadtime', 'application', and 'username' fields.
First I wanted to compute the maximum value of loadtime for all application. Then,create a table/chart which should contain a single row for each application having application name and maximum load time. Table should also have user field's value for the maximum loadtime calculated for each application.
Below is the splunk query which I used for achieving above:
search_string|streamstats max(loadtime) as max_time by application|sort +application -loadTime|streamstats first(max_time) as max_time by application|where loadtime=max_time|table application,max_time,username

Resources