According to the application logic I have implemented own Cypher query builder which creates a complex queries based on the user input from the UI.
This is the example of such queries:
MATCH (dg:DecisionGroup)-[:CONTAINS]->(childD:Decision)
WHERE dg.id = {decisionGroupId}
MATCH (childD)-[relationshipValueRel2:HAS_VALUE_ON]-(filterCharacteristic2:Characteristic)
WHERE filterCharacteristic2.id = 2
WITH relationshipValueRel2, childD, dg
WHERE (ANY (id IN {relationshipValueRel21} WHERE id IN relationshipValueRel2.value )) AND ( relationshipValueRel2.`property.1.3` = {property2} ) OR ( relationshipValueRel2.`property.1.3` = {property3} )
WITH childD, dg
MATCH (childD)-[relationshipValueRel3:HAS_VALUE_ON]-(filterCharacteristic3:Characteristic)
WHERE filterCharacteristic3.id = 3
WITH relationshipValueRel3, childD, dg
WHERE (relationshipValueRel3.`value` = 'London')
WITH childD , dg
SKIP 0 LIMIT 100
WITH *
MATCH (childD)-[ru:CREATED_BY]->(u:User)
OPTIONAL MATCH (childD)-[rup:UPDATED_BY]->(up:User)
RETURN ru, u, rup, up, childD AS decision,
[ (dg)<-[:DEFINED_BY]-(entityGroup)-[:CONTAINS]->(entity)<-[:COMMENTED_ON]-(comg:CommentGroup)-[:COMMENTED_FOR]->(childD)
| {entityId: toInt(entity.id), types: labels(entity), totalComments: toInt(comg.totalComments)} ] AS commentGroups,
[ (c1)<-[vg1:HAS_VOTE_ON]-(childD)
| {criterionId: toInt(c1.id), weight: vg1.avgVotesWeight, totalVotes: toInt(vg1.totalVotes)} ] AS weightedCriteria,
[ (dg)<-[:DEFINED_BY]-(chg:CharacteristicGroup)-[rchgch1:CONTAINS]->(ch1:Characteristic)<-[v1:HAS_VALUE_ON]-(childD)
| {characteristicId: toInt(ch1.id), v1:v1, optionIds: v1.optionIds, valueIds: v1.valueIds, value: v1.value, available: v1.available, totalHistoryValues: v1.totalHistoryValues, totalFlags: v1.totalFlags, description: v1.description, valueType: ch1.valueType, visualMode: ch1.visualMode} ] AS valuedCharacteristics
With the small number of the child:Decision(for example < 1000) it works pretty fast but the performance is noticeably degrades after the data growth.
The primary reason of this are the filtering operation based on the relationship Properties like:
WHERE (ANY (id IN {relationshipValueRel21}
WHERE id IN relationshipValueRel2.value ))
AND ( relationshipValueRel2.`property.1.3` = {property2} )
OR ( relationshipValueRel2.`property.1.3` = {property3} )
Please note that this is just a particular case and the query based on the user input can contains tens such condition with (<, <=, =, >, >=, IN, ALL IN, ANY IN) operations where property value can be a single value or arrays.
Right now I'm thinking how to improve the performance of this query because I expect tens or hundreds of thousands child:Decision there. In order to do this I'm looking into APOC Manual Index on Relationship Properties
Taking this into account - does it make sense to reimplement my query builder in order to use APOC Manual Index on Relationship Properties for the mentioned filtering operation or it will not help there?
Related
I am currently working on a social media site which exactly the same in terms of users' timeline, like user can follow, create, share the posts, block, unblock, etc. So for that, we have created 2 types of labels "User" and "Post" and have several relations like follow, block, private, etc.
currently, we have approximately 41000 nodes and 650000 relationships.
Hardware conf:
8 gb ram
2 core
50 GB HDD
1 Master and 2 Slave
and using the following query to get the users' timeline
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'}),(p:Post{is_deleted:'0'}),(po:User{user_id:p.owner_id})
WHERE (p.post_type = '1' OR p.post_type = '4') WITH n,p,po
WHERE po.is_active='1' AND (n)-[:CREATED{own_status:'1'}]->(p) OR
(n)-[:FOLLOWS{follow_status:'1'}]->(:User{is_active:'1'})-[:CREATED{own_status:'1'}]->(p)
OR (n)-[:FOLLOWS{follow_status:'1'}]->(:Keyword{is_deleted:'0'})-[:KEYWORD]->(p)
WITH n,p,po
OPTIONAL MATCH (n)-[fr:FOLLOWS]->(po)
WHERE fr.follow_status='1' WITH p,n,po,fr
WHERE NOT ((n)-[:FOLLOWS{is_blocked:true}]->(po) OR (n)-[:FOLLOWS{is_mute:true}]->(po)) WITH p,n,po,fr
WHERE NOT (n)<-[:FOLLOWS{is_blocked:true}]-(po) WITH p,n,po,fr
WHERE (fr is not null and toInteger(po.is_private) <= 1 AND po.user_id <> n.user_id)
OR (toInteger(po.is_private) <= 1 AND po.user_id = n.user_id)
OR (toInteger(po.is_private) = 0 AND po.user_id <> n.user_id) WITH p,n,po
RETURN p,po,SIZE(()-[:LIKED]->(p)) as likecount,
SIZE((n)-[:LIKED]->(p)) as likestatus,count(*) as postcount
ORDER BY p.created_at DESC
SKIP 0 LIMIT 10
This query takes more than 10 sec. which is too high
Here is Profile of the above query
Here is the index list
Can anybody suggest where am I doing wrong?
If you're trying to get a user's timeline, I would think you'd start with the specific user, then connect to other nodes via the relationships you're interested in. The current query isn't taking advantage of pattern matching or the connected nature of a graph database.
The first match statement of the query as it's currently written finds a specific user, then all Post nodes that have the property is_deleted:'0' and then all User nodes that are connected to any of the Post nodes. Searching this way is giving you more database hits (54,984) in the first middle Expand(All) than there are nodes in the database (41,000).
Where you should get the most lift in optimizing this query is to focus your search on the single user then expand out from there using the relationships:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
This will match the user and all qualifying posts connected to the user via a relationship. Note, if a user isn't connected to any qualifying posts, there won't be any matches even if that user does exist in the database.
If you only want to include certain relationship types, you can specify that in this first MATCH statement like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r:CREATED|FOLLOWS|KEYWORD]-(p:Post{is_deleted:'0'})
Or you can put it in the WHERE clause like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
WHERE type(r) in ['CREATED', 'FOLLOWS' , 'KEYWORD']
I didn't follow all your conditional statements (and I think you might be able to remove some of them once you convert it to pattern matching), but once you have your initial pattern you can add in whatever conditional statements you need. Example:
WHERE (p.post_type = '1' OR p.post_type = '4')
AND (r.own_status = '1' OR r.follow_status = '1')
AND NOT r.is_blocked = true
For more on pattern matching, check out section 2.9 of the Neo4j Cypher Manual.
I have two facts tables, First and Second, and two dimension tables, dimTime and dimColour.
Fact table First looks like this:
and facet table Second looks like this:
Both dim-tables have 1:* relationships to both fact tables and the filtering is one-directional (from dim to fact), like this:
dimColour[Color] 1 -> * First[Colour]
dimColour[Color] 1 -> * Second[Colour]
dimTime[Time] 1 -> * First[Time]
dimTime[Time] 1 -> * Second[Time_]
Adding the following measure, I would expect the FILTER-functuion not to have any affect on the calculation, since Second does not filter First, right?
Test_Alone =
CALCULATE (
SUM ( First[Amount] );
First[Alone] = "Y";
FILTER(
'Second';
'Second'[Colour]="Red"
)
)
So this should evaluate to 7, since only two rows in First have [Alone] = "Y" with values 1 and 6 and that there is no direct relationship between First and Second. However, this evaluates to 6. If I remove the FILTER-function argument in the calculate, it evaluates to 7.
There are thre additional measures in the pbix-file attached which show the same type of behaviour.
How is filtering one fact table which has no direct relationship to a second fact table affecting the calculation done on the second table?
Ziped Power BI-file: PowerBIFileDownload
Evaluating the table reference 'Second' produces a table that includes the columns in both the Second table, as well as those in all the (transitive) parents of the Second table.
In this case, this is a table with all of the columns in dimColour, dimTime, Second.
You can't see this if you just run:
evaluate 'Second'
as when 'evaluate' returns the results to the user, these "Parent Table" (or "Related") columns are not included.
Even so, these columns are certainly present.
When a table is converted to a row context, these related columns become available via RELATED.
See the following queries:
evaluate FILTER('Second', ISBLANK(RELATED(dimColour[Color])))
evaluate 'Second' order by RELATED(dimTime[Hour])
Similarly, when arguments to CALCULATE are used to update the filter context, these hidden "Related" columns are not ignored; hence, they can end up filtering First, in your example. You can see this, by using a function that strips the related columns, such as INTERSECT:
Test_ActuallyAlone = CALCULATE (
SUM ( First[Amount] ),
First[Alone] = "Y",
//This filter now does nothing, as none of the columns in Second
//have an impact on 'SUM ( First[Amount] )'; and the related columns
//are removed by the INTERSECT.
FILTER(
INTERSECT('Second', 'Second')
'Second'[Colour]="Red"
)
)
(See these resources that describe the "Expanded Table"
(this is an alternative but equivalent explanation of this behaviour)
https://www.sqlbi.com/articles/expanded-tables-in-dax/
https://www.sqlbi.com/articles/context-transition-and-expanded-tables/
)
I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.
We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful
I am a total newbie with Azure! The purpose is to return the rows based on the timestamp stored in the RowKey. As there is a transaction cost with each query, I want to minimize the number of transactions/queries whilst maintain performance
These are the proposed Partition and Row Keys:
Partition Key: TextCache_(AccountID)_(ParentMessageId)
Row Key: (DateOfMessage)_(MessageId)
Legend:
AccountId - is an integer
ParentMessageId - The parent messageId if there is one, blank if it is the parent
DateOfMessage - Date the message was created - format will be DateTime.Ticks.ToString("d19")
MessageId - the unique Id of the message
I would like to get back from a single query the rows and any childrows that is > or < DateOfMessage_MessageId
Can this be done via my proposed PartitionKeys and RowKeys?
ie.. (in psuedo code)
var results = ctx.PartitionKey.StartsWith(TextCache_AccountId)
&& ctx.RowKey > (TimeStamp)_MessageId
Secondly, if there I have a number of accounts, and only want to return back the first 10, could it be done via a single query
ie.. (in psuedo code)
var results = (
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId1)) &&
&& ctx.RowKey > (TimeStamp1)_MessageId1 )
)
||
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId2)) &&
&& ctx.RowKey > (TimeStamp2)_MessageId2 )
) ...
)
.Take(10)
The short answer to your questions is yes, but there are some things you need to watch for.
Azure table storage doesn't have a direct equivalent of .StartsWith(). If you're using the storage library in combination with LINQ you can use .CompareTo() (> and < don't translate properly) which will mean that if you run a search for account 1 and you ask the query to return 1000 results, but there are only 600 results for account 1, the last 400 results will be for account 10 (the next account number lexically). So you'll need to be a bit smart about how you deal with your results.
If you padded out the account id with leading 0s you could do something like this (pseudo code here as well)
ctx.PartionKey > "TextCache_0000000001"
&& ctx.PartitionKey < "TextCache_0000000002"
&& ctx.RowKey > "123465798"
Something else to bear in mind is that queries to Azure Tables return their results in PartitionKey then RowKey order. So in your case messages without a ParentMessageId will be returned before messages with a ParentMessageId. If you're never going to query this table by ParentMessageId I'd move this to a property.
If TextCache_ is just a string constant, it's not adding anything by being included in the PartitionKey unless this will actually mean something to your code when it's returned.
While you're second query will run, I don't think it will produce what you're after. If you want the first ten rows in DateOfMessage order, then it won't work (see my point above about sort orders). If you ran this query as it is and account 1 had 11 messages it will return only the first 10 messages related to account 1 regardless if whether account 2 had an earlier message.
While trying to minimise the number of transactions you use is good practice, don't be too concerned about it. The cost of running your worker/web roles will dwarf your transaction costs. 1,000,000 transactions will cost you $1 which is less than the cost of running one small instance for 9 hours.