I am a total newbie with Azure! The purpose is to return the rows based on the timestamp stored in the RowKey. As there is a transaction cost with each query, I want to minimize the number of transactions/queries whilst maintain performance
These are the proposed Partition and Row Keys:
Partition Key: TextCache_(AccountID)_(ParentMessageId)
Row Key: (DateOfMessage)_(MessageId)
Legend:
AccountId - is an integer
ParentMessageId - The parent messageId if there is one, blank if it is the parent
DateOfMessage - Date the message was created - format will be DateTime.Ticks.ToString("d19")
MessageId - the unique Id of the message
I would like to get back from a single query the rows and any childrows that is > or < DateOfMessage_MessageId
Can this be done via my proposed PartitionKeys and RowKeys?
ie.. (in psuedo code)
var results = ctx.PartitionKey.StartsWith(TextCache_AccountId)
&& ctx.RowKey > (TimeStamp)_MessageId
Secondly, if there I have a number of accounts, and only want to return back the first 10, could it be done via a single query
ie.. (in psuedo code)
var results = (
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId1)) &&
&& ctx.RowKey > (TimeStamp1)_MessageId1 )
)
||
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId2)) &&
&& ctx.RowKey > (TimeStamp2)_MessageId2 )
) ...
)
.Take(10)
The short answer to your questions is yes, but there are some things you need to watch for.
Azure table storage doesn't have a direct equivalent of .StartsWith(). If you're using the storage library in combination with LINQ you can use .CompareTo() (> and < don't translate properly) which will mean that if you run a search for account 1 and you ask the query to return 1000 results, but there are only 600 results for account 1, the last 400 results will be for account 10 (the next account number lexically). So you'll need to be a bit smart about how you deal with your results.
If you padded out the account id with leading 0s you could do something like this (pseudo code here as well)
ctx.PartionKey > "TextCache_0000000001"
&& ctx.PartitionKey < "TextCache_0000000002"
&& ctx.RowKey > "123465798"
Something else to bear in mind is that queries to Azure Tables return their results in PartitionKey then RowKey order. So in your case messages without a ParentMessageId will be returned before messages with a ParentMessageId. If you're never going to query this table by ParentMessageId I'd move this to a property.
If TextCache_ is just a string constant, it's not adding anything by being included in the PartitionKey unless this will actually mean something to your code when it's returned.
While you're second query will run, I don't think it will produce what you're after. If you want the first ten rows in DateOfMessage order, then it won't work (see my point above about sort orders). If you ran this query as it is and account 1 had 11 messages it will return only the first 10 messages related to account 1 regardless if whether account 2 had an earlier message.
While trying to minimise the number of transactions you use is good practice, don't be too concerned about it. The cost of running your worker/web roles will dwarf your transaction costs. 1,000,000 transactions will cost you $1 which is less than the cost of running one small instance for 9 hours.
Related
I am currently working on a social media site which exactly the same in terms of users' timeline, like user can follow, create, share the posts, block, unblock, etc. So for that, we have created 2 types of labels "User" and "Post" and have several relations like follow, block, private, etc.
currently, we have approximately 41000 nodes and 650000 relationships.
Hardware conf:
8 gb ram
2 core
50 GB HDD
1 Master and 2 Slave
and using the following query to get the users' timeline
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'}),(p:Post{is_deleted:'0'}),(po:User{user_id:p.owner_id})
WHERE (p.post_type = '1' OR p.post_type = '4') WITH n,p,po
WHERE po.is_active='1' AND (n)-[:CREATED{own_status:'1'}]->(p) OR
(n)-[:FOLLOWS{follow_status:'1'}]->(:User{is_active:'1'})-[:CREATED{own_status:'1'}]->(p)
OR (n)-[:FOLLOWS{follow_status:'1'}]->(:Keyword{is_deleted:'0'})-[:KEYWORD]->(p)
WITH n,p,po
OPTIONAL MATCH (n)-[fr:FOLLOWS]->(po)
WHERE fr.follow_status='1' WITH p,n,po,fr
WHERE NOT ((n)-[:FOLLOWS{is_blocked:true}]->(po) OR (n)-[:FOLLOWS{is_mute:true}]->(po)) WITH p,n,po,fr
WHERE NOT (n)<-[:FOLLOWS{is_blocked:true}]-(po) WITH p,n,po,fr
WHERE (fr is not null and toInteger(po.is_private) <= 1 AND po.user_id <> n.user_id)
OR (toInteger(po.is_private) <= 1 AND po.user_id = n.user_id)
OR (toInteger(po.is_private) = 0 AND po.user_id <> n.user_id) WITH p,n,po
RETURN p,po,SIZE(()-[:LIKED]->(p)) as likecount,
SIZE((n)-[:LIKED]->(p)) as likestatus,count(*) as postcount
ORDER BY p.created_at DESC
SKIP 0 LIMIT 10
This query takes more than 10 sec. which is too high
Here is Profile of the above query
Here is the index list
Can anybody suggest where am I doing wrong?
If you're trying to get a user's timeline, I would think you'd start with the specific user, then connect to other nodes via the relationships you're interested in. The current query isn't taking advantage of pattern matching or the connected nature of a graph database.
The first match statement of the query as it's currently written finds a specific user, then all Post nodes that have the property is_deleted:'0' and then all User nodes that are connected to any of the Post nodes. Searching this way is giving you more database hits (54,984) in the first middle Expand(All) than there are nodes in the database (41,000).
Where you should get the most lift in optimizing this query is to focus your search on the single user then expand out from there using the relationships:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
This will match the user and all qualifying posts connected to the user via a relationship. Note, if a user isn't connected to any qualifying posts, there won't be any matches even if that user does exist in the database.
If you only want to include certain relationship types, you can specify that in this first MATCH statement like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r:CREATED|FOLLOWS|KEYWORD]-(p:Post{is_deleted:'0'})
Or you can put it in the WHERE clause like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
WHERE type(r) in ['CREATED', 'FOLLOWS' , 'KEYWORD']
I didn't follow all your conditional statements (and I think you might be able to remove some of them once you convert it to pattern matching), but once you have your initial pattern you can add in whatever conditional statements you need. Example:
WHERE (p.post_type = '1' OR p.post_type = '4')
AND (r.own_status = '1' OR r.follow_status = '1')
AND NOT r.is_blocked = true
For more on pattern matching, check out section 2.9 of the Neo4j Cypher Manual.
I am having trouble to understand why a for loop construction does not work. I am not really used to for loops so I apologize if I am missing something basic. Anyhow, I appreciate any piece of advice you might have.
I am using a party level dataset from the parlgov project. I am trying to create a variable which captures how many times a party has been in government before the current observation. Time is important, the counter should be zero if a party has not been in government before, even if after the observation period it entered government multiple times. Parties are nested in countries and in cabinet dates.
The code is as follows:
use "http://eborbath.github.io/stackoverflow/loop.dta", clear //to get the data
if this does not work, I also uploaded in a csv format, try:
import delimited "http://eborbath.github.io/stackoverflow/loop.csv", bindquote(strict) encoding(UTF-8) clear
The loop should go through each country-specific cabinet date, identify the previous observation and check if the party has already been in government. This is how far I have got:
gen date2=cab_date
gen gov_counter=0
levelsof country, local(countries) // to get to the unique values in countries
foreach c of local countries{
preserve // I think I need this to "re-map" the unique cabinet dates in each country
keep if country==`c'
levelsof cab_date, local(dates) // to get to the unique cabinet dates in individual countries
restore
foreach i of local dates {
egen min_date=min(date2) // this is to identify the previous cabinet date
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 if date2==min_date & cabinet_party[_n-1]==1 // this should be the counter
bysort country: replace date2=. if date2==min_date // this is to drop the observation which was counted
drop min_date //before I restart the nested loop, so that it again gets to the minimum value in `dates'
}
}
The code works without an error, but it does not do the job. Evidently there's a mistake somewhere, I am just not sure where.
BTW, it's a specific application of a problem I super often encounter: how do you count frequencies of distinct values in a multilevel data structure? This is slightly more specific, to the extent that "time matters", and it should not just sum all encounters. Let me know if you have an easier solution for this.
Thanks!
The problem with your loop is that it does not keep the replaced gov_counter after the loop. However, there is a much easier solution I'd recommend:
sort country party_id cab_date
by country party_id: gen gov_counter=sum(cabinet_party[_n-1])
This sorts the data into groups and then creates a sum by group, always up to (but not including) the current observation.
I would start here. I have stripped the comments so that we can look at the code. I have made some tiny cosmetic alterations.
foreach i of local dates {
egen min_date = min(date2)
sort country party_id date2
bysort country party_id: replace gov_counter=gov_counter+1 ///
if date2 == min_date & cabinet_party[_n-1] == 1
bysort country: replace date2 = . if date2 == min_date
drop min_date
}
This loop includes no reference to the loop index i defined in the foreach statement. So, the code is the same and completely unaffected by the loop index. The variable min_date is just a constant for the dataset and the same each time around the loop. What does depend on how many times the loop is executed is how many times the counter is incremented.
The fallacy here appears to be a false analogy with constructs in other software, in which a loop automatically spawns separate calculations for different values of a loop index.
It's not illegal for loop contents never to refer to the loop index, as is easy to see
forval j = 1/3 {
di "Hurray"
}
produces
Hurray
Hurray
Hurray
But if you want different calculations for different values of the loop index, that has to be explicit.
I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.
I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.
We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful