Linq Query Where Contains - performance

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks

You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.

The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

Related

Neo4j query taking long time

I am currently working on a social media site which exactly the same in terms of users' timeline, like user can follow, create, share the posts, block, unblock, etc. So for that, we have created 2 types of labels "User" and "Post" and have several relations like follow, block, private, etc.
currently, we have approximately 41000 nodes and 650000 relationships.
Hardware conf:
8 gb ram
2 core
50 GB HDD
1 Master and 2 Slave
and using the following query to get the users' timeline
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'}),(p:Post{is_deleted:'0'}),(po:User{user_id:p.owner_id})
WHERE (p.post_type = '1' OR p.post_type = '4') WITH n,p,po
WHERE po.is_active='1' AND (n)-[:CREATED{own_status:'1'}]->(p) OR
(n)-[:FOLLOWS{follow_status:'1'}]->(:User{is_active:'1'})-[:CREATED{own_status:'1'}]->(p)
OR (n)-[:FOLLOWS{follow_status:'1'}]->(:Keyword{is_deleted:'0'})-[:KEYWORD]->(p)
WITH n,p,po
OPTIONAL MATCH (n)-[fr:FOLLOWS]->(po)
WHERE fr.follow_status='1' WITH p,n,po,fr
WHERE NOT ((n)-[:FOLLOWS{is_blocked:true}]->(po) OR (n)-[:FOLLOWS{is_mute:true}]->(po)) WITH p,n,po,fr
WHERE NOT (n)<-[:FOLLOWS{is_blocked:true}]-(po) WITH p,n,po,fr
WHERE (fr is not null and toInteger(po.is_private) <= 1 AND po.user_id <> n.user_id)
OR (toInteger(po.is_private) <= 1 AND po.user_id = n.user_id)
OR (toInteger(po.is_private) = 0 AND po.user_id <> n.user_id) WITH p,n,po
RETURN p,po,SIZE(()-[:LIKED]->(p)) as likecount,
SIZE((n)-[:LIKED]->(p)) as likestatus,count(*) as postcount
ORDER BY p.created_at DESC
SKIP 0 LIMIT 10
This query takes more than 10 sec. which is too high
Here is Profile of the above query
Here is the index list
Can anybody suggest where am I doing wrong?
If you're trying to get a user's timeline, I would think you'd start with the specific user, then connect to other nodes via the relationships you're interested in. The current query isn't taking advantage of pattern matching or the connected nature of a graph database.
The first match statement of the query as it's currently written finds a specific user, then all Post nodes that have the property is_deleted:'0' and then all User nodes that are connected to any of the Post nodes. Searching this way is giving you more database hits (54,984) in the first middle Expand(All) than there are nodes in the database (41,000).
Where you should get the most lift in optimizing this query is to focus your search on the single user then expand out from there using the relationships:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
This will match the user and all qualifying posts connected to the user via a relationship. Note, if a user isn't connected to any qualifying posts, there won't be any matches even if that user does exist in the database.
If you only want to include certain relationship types, you can specify that in this first MATCH statement like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r:CREATED|FOLLOWS|KEYWORD]-(p:Post{is_deleted:'0'})
Or you can put it in the WHERE clause like this:
MATCH (n:User {user_id:'12129bca-9b90-44c9-aae8-d80e61f9c342',is_active:'1'})-[r]-(p:Post{is_deleted:'0'})
WHERE type(r) in ['CREATED', 'FOLLOWS' , 'KEYWORD']
I didn't follow all your conditional statements (and I think you might be able to remove some of them once you convert it to pattern matching), but once you have your initial pattern you can add in whatever conditional statements you need. Example:
WHERE (p.post_type = '1' OR p.post_type = '4')
AND (r.own_status = '1' OR r.follow_status = '1')
AND NOT r.is_blocked = true
For more on pattern matching, check out section 2.9 of the Neo4j Cypher Manual.

2 similar LINQ statements with different syntax yielding different output

I have to modify some old code in an application that someone before me made. Looking at the variable below whose result goes into "test", I have two tables (which are set up with relational models). In the variable "test2", I have rewritten the query in the more SQL syntax (which I'm used to). I want to join on the Lines and Shifts table where the LineId's match. When I view the "test2" output, I get 6 values where the end time is 2-28-2017 8:30, 9:30 ... 1:30, and 2:30. That makes sense. When I view the "test" output, I see one Line with around 500 Shift entries associated to it. Inspecting those elements yields end times that go back to 2017. Should I not get the same 6 entries in the "test" output that I got back in the "test2" output? Is there something that I'm missing behind the scenes that linq is doing different in the "test" output? Any help would be greatly appreciated!
var test = entityFrameworkDateModel.Lines.Where(line => line.Shifts.Any(s => shift.EndTime >= DateTime.Now));
var test2 = from line in entityFrameworkDateModel.Lines
join shift in entityFrameworkDateModel.Shifts on line.LineID equals shift.LineId
where shift.EndTime >= DateTime.Now
select new
{
line.LineID,
shift.EndTime
};
test is a collection of Line objects that has 0 to many Shift objects. I would expect that
test.SelectMany(t => t.Shifts).Count() == 500 // approx. 500 anyways
test2 is a collection of AnonymousObjects. test2 is flattening your data with one object per LineId / Shift End Time pair. Where as test is keeping your data in a hierarchy.
Inspecting those elements yields end times that go back to 2017.
Test can and will contain shifts that are not matching your where criteria. Since you are only returning Line objects that have shifts with an end time greater than now. So your Line object will have 1 or more shifts matching EndTime >= DateTime.Now. But the .Any() does not filter out the other Shift objects Where EndTime < DateTime.Now.
You can add a SelectMany then Where to return all Shift objects matching your criteria:
var test = entityFrameworkDateModel.Lines
.SelectMany(line => line.Shifts)
.Where(shift => shift.EndTime >= DateTime.Now);
Those 2 are not the same, even though they feel similar. For the first query, the nested "any" filtering isn't needed. The "where" alone is enough. The any is actually going to return just true, which will short circuit the where. I'd lay out the correct syntax for the where clause, but I'm on mobile SO and can't see the question while I'm answering

Slicing neo4j Cypher results in chunks

I want to slice Cypher results in chunks of 100 rows, and be able to retrieve a specific chunk.
At the moment, the only way to ensure that rows are not mixed-up is to user ORDER BY which makes the query very inefficient ( 3sec. for me is too much)
MATCH (p:Person) RETURN p.id ORDER BY p.id SKIP {chunk}*100 LIMIT 100
where {chunk} is an external parameter to identify a specific chunk.
Any suggestions?
PS: the property p.id is indexed.
You may try something like adding label to Person before extracting chunks and then using query like
Match (p:Chunk:Person) with p LIMIT 100
Match (p) remove p:Chunk
Return *
If the p.id values are unique and dense (say, the value starts at 1 and increments, without any gaps), then this query will take advantage of the index on :Person(id) to efficiently get each hundred-Person chunk:
WITH (({chunk} - 1) * 100 + 1) AS startId
MATCH (p:Person)
WHERE p.id IN RANGE(startId, startId + 99)
RETURN p.id
ORDER BY p.id
Now, practically speaking, your id space will probably not remain dense, even if it started out that way. Person nodes will be deleted over time. In that case, the above query can return fewer than 100 rows. So, you can make your chunk size bigger than 100 and do some post-processing to get the 100 you need. In the worst case, you may need to make multiple requests to get the 100 you need, but each request will be fast. (Ideally, you would want to assign no-longer-unused id values to new Person nodes, to fill up gaps in the id space -- but this would require you to scan for the gaps.)

how can I group sum and count with sequel ORM and postgresl?

This is too tough for me guys. It's for Jeremy!
I have two tables (although I can also envision needing to join a third table) and I want to sum one field and count rows, in the same, table while joining with another table and return the result in json format.
First of all, the data type field that needs to be summed, is numeric(10,2) and the data is inserted as params['amount'].to_f.
The tables are expense_projects which has the name of the project and the company id and expense_items which has the company_id, item and amount (to mention just the critical columns) - the "company_id" columns are disambiguated.
So, the following code:
expense_items = DB[:expense_projects].left_join(:expense_items, :expense_project_id => :project_id).where(:project_company_id => company_id).to_a.to_json
works fine but when I add
expense_total = expense_items.sum(:amount).to_f.to_json
I get an error message which says
TypeError - no implicit conversion of Symbol into Integer:
so, the first question is why and how can this be fixed?
Then I want to join the two tables and get all the project names form the left (first table) and sum amount and count items in the second table. I have tried
DB[:expense_projects].left_join(:expense_items, :expense_items_company_id => expense_projects_company_id).count(:item).sum(:amount).to_json
and variations of this, all of which fails.
I would like a result which gets all the project names (even if there are no expense entries and returns something like:
project item_count item_amount
pr 1 7 34.87
pr 2 0 0
and so on. How can this be achieved with one query returning the result in json format?
Many thanks, guys.
Figured it out, I hope this helps somebody else:
DB[:expense_projects___p].where(:project_company_id=>user_company_id).
left_join(:expense_items___i, :expense_project_id=>:project_id).
select_group(:p__project_name).
select_more{count(:i__item_id)}.
select_more{sum(:i__amount)}.to_a.to_json

Azure Table Storage - PartitionKey and RowKey selection to use between query

I am a total newbie with Azure! The purpose is to return the rows based on the timestamp stored in the RowKey. As there is a transaction cost with each query, I want to minimize the number of transactions/queries whilst maintain performance
These are the proposed Partition and Row Keys:
Partition Key: TextCache_(AccountID)_(ParentMessageId)
Row Key: (DateOfMessage)_(MessageId)
Legend:
AccountId - is an integer
ParentMessageId - The parent messageId if there is one, blank if it is the parent
DateOfMessage - Date the message was created - format will be DateTime.Ticks.ToString("d19")
MessageId - the unique Id of the message
I would like to get back from a single query the rows and any childrows that is > or < DateOfMessage_MessageId
Can this be done via my proposed PartitionKeys and RowKeys?
ie.. (in psuedo code)
var results = ctx.PartitionKey.StartsWith(TextCache_AccountId)
&& ctx.RowKey > (TimeStamp)_MessageId
Secondly, if there I have a number of accounts, and only want to return back the first 10, could it be done via a single query
ie.. (in psuedo code)
var results = (
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId1)) &&
&& ctx.RowKey > (TimeStamp1)_MessageId1 )
)
||
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId2)) &&
&& ctx.RowKey > (TimeStamp2)_MessageId2 )
) ...
)
.Take(10)
The short answer to your questions is yes, but there are some things you need to watch for.
Azure table storage doesn't have a direct equivalent of .StartsWith(). If you're using the storage library in combination with LINQ you can use .CompareTo() (> and < don't translate properly) which will mean that if you run a search for account 1 and you ask the query to return 1000 results, but there are only 600 results for account 1, the last 400 results will be for account 10 (the next account number lexically). So you'll need to be a bit smart about how you deal with your results.
If you padded out the account id with leading 0s you could do something like this (pseudo code here as well)
ctx.PartionKey > "TextCache_0000000001"
&& ctx.PartitionKey < "TextCache_0000000002"
&& ctx.RowKey > "123465798"
Something else to bear in mind is that queries to Azure Tables return their results in PartitionKey then RowKey order. So in your case messages without a ParentMessageId will be returned before messages with a ParentMessageId. If you're never going to query this table by ParentMessageId I'd move this to a property.
If TextCache_ is just a string constant, it's not adding anything by being included in the PartitionKey unless this will actually mean something to your code when it's returned.
While you're second query will run, I don't think it will produce what you're after. If you want the first ten rows in DateOfMessage order, then it won't work (see my point above about sort orders). If you ran this query as it is and account 1 had 11 messages it will return only the first 10 messages related to account 1 regardless if whether account 2 had an earlier message.
While trying to minimise the number of transactions you use is good practice, don't be too concerned about it. The cost of running your worker/web roles will dwarf your transaction costs. 1,000,000 transactions will cost you $1 which is less than the cost of running one small instance for 9 hours.

Resources