I have a kind "Customers" in GCP datastore and I have an existing query to return list of customers sorted by creation date desc
Query #1
q := datastore.NewQuery(CUSTOMERS).
Filter("User=", userKey).
Filter("Country=",country)
Filter("CreatedOn >", Zero).
SortOrder("-CreatedOn")
The composite index used is
- kind: Customer
properties:
- name: User
- name: Country
- name: CreatedOn
direction: desc
Query #2
Now I want to write a util to check if customer exists or not , something like below
q := datastore.NewQuery(CUSTOMERS).
Filter("User=", userKey).
Filter("Country=",country)
Filter("CreatedOn >", Zero).
KeysOnly()
I tested and found in my local that the existing index for Query1 also serves Query2 ,
is it worth adding any new index without direction for Query 2? I am reluctant as the existing index for query1 is already taking 25GB memory ?
What performance impact it will have?
Related
The purpose is to build a Audit Log service to store change history for an application. The following information will be stored:
- srcId
- srcTable
- changedAt
- changes
- user
For example, for the username change of id = 10 from profile table will be saved like
srcId = 10,
srcTable = 'profile',
changedAt = timestamp,
changes: { name: { old: old-value, new: new-value } },
user: the-user-id-who-performed-the-change-action
Trying to build the system based on aws lambda, dynamodb.
Initially thought to use srcId as sort-key, srcTable as partition key. But as there can be multiple entries for same srcId/srcTable pair, where the changedAt will be different. Any suggestion on how the index or primary key should be set for better performance gain.
P.S. expected queries
most of the query will be to obtain change list for srcId/srcTable pair [ ~90% ]
finding change history for srcTable [ 5% ~ 7% ]
finding changes made by user [ 3% ~ 5% ]
(percentage values are roughly perceived about the expected behaviour)
It would be best that your Table PK have a high cardinality. Based on your examples, you could achieve that by creating a computed column on the srcTable:srcId combination (I prefer using : instead of #) and then use your timestamp as your sort key.
This achieves the primary query pattern which is to find the change list for a given srcTable and srcId. Using the timestamp will allow you to query the table based on date ranges which in my experience is a common use case for audit logs.
Table (Get change history for <srcTable>:<srcId>):
PK - srcTable:srcId
SK - changedAt
GSI1 (Get change history for <srcTable>):
PK - srcTable
SK - changedAt
GSI2 (Get changes performed by <user>):
PK - user
SK - changedAt
Your items may look similar to this:
PK (srcTable:srcId)
SK (<changedAt>)
srcTable
srcId
userId
changes
profile:10
2021-11-25T06:02:27.163Z
profile
10
1
{}
user:1
2021-11-25T06:03:09.811Z
user
1
1
{}
profile:12
2021-11-25T06:04:17.178Z
profile
12
4
{}
You might have looked at this already but for the sake of future readers, I would highly suggest watching or following Rick Houlihan's videos on Advanced DynamoDB modeling. Another great resource is Alex DeBrie's DynamoDB Book.
Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.
I have an index with 2 index pattern using alias.
Example:
Index Name: my_index
Fields: sender_name, receiver_name, item_name
Alias: my_index_alias_1, my_index_alias_2
Index Patterns: my_index_alias_1, my_index_alias_2
I have a dashboard with two data tables using my_index_alias_1 and my_index_alias_2.
Same person can also be sender and receiver but there should be only one filter to select the user.
Example:
If a user named Bob is filtered.
my_index_alias_1 Data Table should filter by received_name
my_index_alias_2 Data Table should filter by sender_name
I don't want do have duplicate index, so I think scripted field is the better option.
But scripted field can solve this only when I can access the alias name using doc_value, so then I can write condition like the below Pseudocode
if doc['_alias'].value=='my_index_alias_1' then doc['received_name'].value
if doc['_alias'].value=='my_index_alias_2 ' then doc['sender_name'].value
Is there any way to categorise an aggregated record in Amazon Quicksight based on the records being aggregated?
I'm trying to build a pivot table an Amazon Quicksight to pull out counts of various records split down by what state the record is in limited to a date range where the record first had activity.
This is all being done in SPICE since the raw data is CSV files in S3. The two CSV files are:
Student
Id: String
Current State: One of pass / fail
Department: String
Test
Id: String
Student Id: String
DateOfTest: DateTime
State: One of pass / fail
The pivot table I'm trying to build would have a row for each department and 3 columns:
fail - Count of students in the department where the current state is fail
pass-never-failed - Count of students in the department where the current state is pass and there are no failed tests for the Student
pass-failed-some - Count of students in the department where the current state is pass and there is at least one failed test
To do this I've created a dataset that contains Student joined to Test. I can then create a Pivot table with:
Row: Department
Column: Current State
Value: Count Distinct(Id[Student])
I thought I'd be able to create a calculated field to give me the three categories and use that in place of Current State using the following calculation:
ifelse(
{Current State} = 'pass',
ifelse(
countIf({Id[Test]}, {State} = 'fail') = 0,
'pass-never-failed',
'pass-failed-some'
),
{Current State}
)
But that shows invalid with the following error:
Mismatched aggregation. Custom aggregations can’t contain both aggregate "`COUNT`" and non-aggregated fields “COUNT(CASE WHEN "State" = _UTF16'fail' THEN "Id[Test]" ELSE NULL END)”, in any combination.
Is there any way to categories the Students based on an aggregation of the Tests in quicksight or do I have to pre-calculate this information in the source data?
I've been able to workaround this for now by defining three separate calculations for the three columns and adding these as values in the quicksight pivot rather than setting a column dimension at all.
Count Passed Never Failed
distinct_countIf({Id[Student]}, maxOver(
ifelse({State[Test]} = 'fail' AND {Current State[Student]} = 'pass',
1, 0)
, [{Id[Student]}], PRE_AGG) = 0)
Count Passed Failed Some
distinct_countIf({Id[Student]}, maxOver(
ifelse({State[Test]} = 'fail' AND {Current State[Student]} = 'pass',
1, 0)
, [{Id[Student]}], PRE_AGG) > 0)
Count Failed
distinct_countIf({Id[Student]}, {Current State[Student]} = 'fail')
This works, but I'd still like to know if it's possible to build a dimension that I could use for this as it would be more flexible if new states are added without the special handling of pass.
We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful