I have a table with an enumerated column named "status". I am implementing an endpoint to get statistics about active and inactive entries. It will return a response like this
{ "activeCount" : 10, "inactiveCount" : 10 }
There are 4 possible status for each entry (active, inactive, awaitingApproval, suspicious). activeCount = amount of entries with active status.
inactiveCount = amount of entries with inactive/awaitingApproval/suspicious status.
I am using controller-service-repository pattern and H2 in-memory database. I need this to be as fast as possible. Also assume that this table will hold massive amount of data in the future so getting all entries into memory and calculating the status statistics is not possible.
What are your best practice suggestions?
Thanks for help in advance.
Just use a query like select e.status, count(*) from Entity e group by e.status. If this is not fast enough for you, you will have to maintain a current count per group somehow in a dedicated table and just query that. That obviously requires you to change the count respectively for every status change or insert/delete. Usually, this can be done by using triggers.
Related
I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference
Is there a way to limit the rows returned for a user in Oracle.
We have some users than can query some tables with millions of records decreasing the performance of the database, so I would like to know if there someway to set max size of records per user.
For example, If I have the table: APP.HISTORY with 10,000,000 records and the user 'dummy', I would like to set for dummy user that can only read 10,000 records from it.
For example if 'dummy' execute:
select * from APP.HISTORY
It will only return 10,000 records, instead try to fetch the 10,000,000 records
There isn't any built-in functionality to limit the number of results per user.
However, even if you could, that wouldn't necessarily help you resolve your performance concern.
Consider for example a query like:
select *
from (select *
from app.history
order by some_field desc)
where rownum < 2
According to your requirements, user dummy would be able to run this and get back the single result he's interested in. However, assuming some_field is not indexed, then, even though this query will return a single record, it still has to order all 10,000,000 records to produce that single row.
As suggested by OldProgrammer in the comments, consider using resource groups, which is a very flexible and configurable way of throttling CPU and I/O usage.
Otherwise, if you don't trust user dummy to write smart and efficient queries, then don't give him direct access to the database.
If you query for a range of dates and another thing, is having the date column as a leading edge in your index a bad thing?
I'm using PostgreSQL, but assume this applies to all B-tree indexes.
Let's say I queried for records where the created date was 2013-01-02 or later and the status is Active. I'm fuzzy on how a B-tree index would organize dates, but here's how I imagine it. If the index was on (created, status), it would be structured roughly like this:
created status
------------------------
2013-01-01 Active
2013-01-01 Inactive
2013-01-02 Active <-- This record is selected
2013-01-02 Inactive
2013-01-03 Active <-- This non-adjacent record is selected (SLOW)
2013-01-03 Inactive
If the index was on (status, created):
status created
------------------------
Active 2013-01-01
Active 2013-01-02 <-- This record is selected
Active 2013-01-03 <-- This adjacent record is selected (FAST)
Inactive 2013-01-01
Inactive 2013-01-02
Inactive 2013-01-03
So in my mind, if you used a date as a leading edge and query for a range of those dates, then records you want would be fragmented in the index, leading to poorer performance. It's even worse with a datetime.
I think your best best here is to use an expression index. It sounds like you will mostly be running queries such as:
select * from my_table where status='Active' and created_date > whatever
If that is the case, you would likely see the best performance by creating the index on creation date, filtered by status:
CREATE INDEX active_status_created_idx on my_table(created) WHERE status='Active'
That will result in a significantly smaller index that can be used for any queries including the WHERE status='Active' clause.
See:
https://devcenter.heroku.com/articles/postgresql-indexes#expression-indexes and
http://www.postgresql.org/docs/9.2/static/indexes-expressional.html
You have it all correct in your assumptions insofar as I read. You should pick your index according to the types of queries you're going to do most.
If you're doing a lot of where status = ? order by created limit 10 or order by status, created limit 10, then an index on (status, created) is usually in order.
If you're doing a lot of where created = ? order by status limit 10 or order by created, status limit 10, then you'll typically want an index on (created, status) instead.
Note that Postgres allows explicit sorting for indexes too, e.g. (created, status desc). The docs provide a lengthy discussion on why this is sometimes desirable. (I can't recall where exactly, but I'm sure you've found it already considering how you phrased your question.)
Also note the limit in each case. Usage of the index for the ordering clause depends on the number of rows that you're retrieving. Fetch enough rows and Postgres may prefer to ignore your carefully created index altogether, and top-n sort of rows retrieved through other means instead.
Lastly, note that Postgres is quite good, especially in recent versions, at managing multiple independent indexes on a single column. In fact, there is a discussion in the manual's chapters related to indexes that discusses precisely this point.
If you've an index on (created) and another on (status), it'll know to do a bitmap index scan on queries such as where status = ? and/or created = ? when both are selective enough. Along the same lines, it'll know to simply use the index on (created) for queries such as where status = ? order by created limit 10, and filter out rows where the status doesn't have the right value.
I have a product search engine using Coldfusion8 and MySQL 5.0.88
The product search has two display modes: Multiple View and Single View.
Multiple displays basic record info, Single requires additional data to be polled from the database.
Right now a user does a search and I'm polling the database for
(a) total records and
(b) records FROM to TO.
The user always goes to Single view from his current resultset, so my idea was to store the current resultset for each user and not have to query the database again to get (waste a) overall number of records and (waste b) a the single record I already queried before AND then getting the detail information I still need for the Single view.
However, I'm getting nowhere with this.
I cannot cache the current resultset-query, because it's unique to each user(session).
The queries are running inside a CFINVOKED method inside a CFC I'm calling through AJAX, so the whole query runs and afterwards the CFC and CFINVOKE method are discarded, so I can't use query of query or variables.cfc_storage.
So my idea was to store the current resultset in the Session scope, which will be updated with every new search, the user runs (either pagination or completely new search). The maximum results stored will be the number of results displayed.
I can store the query allright, using:
<cfset Session.resultset = query_name>
This stores the whole query with results, like so:
query
CACHED: false
EXECUTIONTIME: 2031
SQL: SELECT a.*, p.ek, p.vk, p.x, p.y
FROM arts a
LEFT JOIN p ON
...
LEFT JOIN f ON
...
WHERE a.aktiv = "ja"
AND
... 20 conditions ...
SQLPARAMETERS: [array]
1) ... 20+ parameters
RESULTSET:
[Record # 1]
a: true
style: 402
price: 2.3
currency: CHF
...
[Record # 2]
a: true
style: 402abc
...
This would be overwritten every time a user does a new search. However, if a user wants to see the details of one of these items, I don't need to query (total number of records & get one record) if I can access the record I need from my temp storage. This way I would save two database trips worth 2031 execution time each to get data which I already pulled before.
The tradeoff would be every user having a resultset of up to 48 results (max number of items per page) in Session.scope.
My questions:
1. Is this feasable or should I requery the database?
2. If I have a struture/array/object like a the above, how do I pick the record I need out of it by style number = how do I access the resultset? I can't just loop over the stored query (tried this for a while now...).
Thanks for help!
KISS rule. Just re-query the database unless you find the performance is really an issue. With the correct index, it should scales pretty well. When the it is an issue, you can simply add query cache there.
QoQ would introduce overhead (on the CF side, memory & computation), and might return stale data (where the query in session is older than the one on DB). I only use QoQ when the same query is used on the same view, but not throughout a Session time span.
Feasible? Yes, depending on how many users and how much data this stores in memory, it's probably much better than going to the DB again.
It seems like the best way to get the single record you want is a query of query. In CF you can create another query that uses an existing query as it's data source. It would look like this:
<cfquery name="subQuery" dbtype="query">
SELECT *
FROM Session.resultset
WHERE style = #SelectedStyleVariable#
</cfquery>
note that if you are using CFBuilder, it will probably scream Error at you for not having a datasource, this is a bug in CFBuilder, you are not required to have a datasource if your DBType is "query"
Depending on how many records, what I would do is have the detail data stored in application scope as a structure where the ID is the key. Something like:
APPLICATION.products[product_id].product_name
.product_price
.product_attribute
Then you would really only need to query for the ID of the item on demand.
And to improve the "on demand" query, you have at least two "in code" options:
1. A query of query, where you query the entire collection of items once, and then query from that for the data you need.
2. Verity or SOLR to index everything and then you'd only have to query for everything when refreshing your search collection. That would be tons faster than doing all the joins for every single query.
I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.