Linq query returns duplicate results when .Distinct() isn't used - why?

Linq query returns duplicate results when .Distinct() isn't used - why? - linq

When I use the following Linq query in LinqPad I get 25 results returned:
var result = (from l in LandlordPreferences
where l.Name == "Wants Student" && l.IsSelected == true
join t in Tenants on l.IsSelected equals t.IsStudent
select new { Tenant = t});
result.Dump();
When I add .Distinct() to the end I only get 5 results returned, so, I'm guessing I'm getting 5 instances of each result when the above is used.
I'm new to Linq, so I'm wondering if this is because of a poorly built query? Or is this the way Linq always behaves? Surely not - if I returned 500 rows with .Distinct(), does that mean without it there's 2,500 returned? Would this compromise performance?

It's a poorly built query.
You are joining LandlordPreferences with Tenants on a boolean value instead of a foreign key.
So, most likely, you have 5 selected land lords and 5 tenants that are students. Each student will be returned for each land lord: 5 x 5 = 25. This is a cartesian product and has nothing to do with LINQ. A similar query in SQL would behave the same.
If you would add the land lord to your result (select new { Tenant = t, Landlord = l }), you would see that no two results are actually the same.
If you can't fix the query somehow, Distinct is your only option.

Related

Linq query increase performance efficient query

int mappedCount = (from product in products
from productMapping in DbContext.ProductCategoryMappings
.Where(x => product.TenantId == x.TenantId.ToString() &&
x.ProductId.ToString().ToUpper() == product.ProductGuid.ToUpper())
join tenantCustMapping in DbContext.TenantCustCategories
on productMapping.Value equals tenantCustMapping.Id
select 1).ToList().Sum();
I need to increase the performance.
When mapping product two tables each item having multiple product

If you want to increase performance, you'd need to know the volumes of data that are getting sent around. How many "products" are in your variable.
It may be quicker to update your products list to contain integers / guids and send that to your database rather than send strings that the database has to run ToUpper() on before comparing them.
something like:
var convertedList = products.Select( new {TenantId = int.parse(product.TenantId), productId = Guid.Parse(x.ProductId)}
Then sending that to your Db, and comparing them directly
I think changing "Select 1).ToList().Sum()" to ".Count()" will improve performance. Even if not, it'll help readability.

Linq Query Where Contains

I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks

You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.

The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.

Slow Neo4j query despite indices

Here I'm trying to find all Twitter users who are followed by and who follow any members of some group G:
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE (x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}})
AND (y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}})
RETURN t.id
But for the group G I sometime have their screen names and sometimes have their ids, thus the OR clause above. Unfortunately this query is long running and doesn't appear to ever return.
I have indices and constraints on both on both id and screen_name:
Indexes
ON :User(screen_name) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.screen_name IS UNIQUE
ON (user:User) ASSERT user.id IS UNIQUE
If I get rid of the OR clause (for instance if I happen to have all screen_names or all ids for group G) then the query runs quite fast.
I'm using neo4j-community-2.1.3 on a Mac. My graph has 286039 nodes, all of which have the User label.
And ideas to improve this? Otherwise I'll have to chop this up into 4 queries to get all possible combinations of members. This is really even more problematic because I really want to keep track of how commonly a user appears in a G-->user-->G relationship, and I'll need to do a lot of extra bookkeeping if the counts are spread among 4 different queries.
Update
I created an issue related to this: https://github.com/neo4j/neo4j/issues/2834
I ended up using
MATCH (x:User) WHERE x.screen_name IN ["apple","banana","coconut"]
WITH collect(id(x)) as x_ids
MATCH (x:User) WHERE x.id in [12345,98765]
WITH x_ids+collect(id(x)) as x_ids
MATCH (y:User) WHERE y.screen_name IN ["apple","banana","coconut"]
WITH x_ids,collect(id(y)) as y_ids
MATCH (y:User) WHERE y.id in [12345,98765]
WITH x_ids,y_ids+collect(id(y)) as y_ids
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE id(x) in x_ids AND id(y) in y_ids
RETURN count(*) as c, t.screen_name,t.id
ORDER BY c DESC
LIMIT 1000
But this basically represents a hack to get around a place where neo4j isn't using the indices that it could be.

I guess the query does not make use of indexes due to the OR condition, you can verify by prefixing the query with PROFILE and run it in neo4j-shell.
If there's no notion of index usage, you might split the query up into two parts. The first one fetches the combined list of user ids, instead of the OR we do a UNION on two queries (each using a index lookup):
MATCH (x:User) WHERE x.screen_name in {G_SCREEN_NAMES} RETURN id(x) as ids UNION
MATCH (x:User) WHERE x.id in {G_IDS} RETURN id(x) as ids
On the client side, use the list of node ids as parameter for the next query:
MATCH (x:User)-[:FOLLOWS]->(t)-[:FOLLOWS]->(y)
WHERE id(x) in {ids} AND id(y) in {ids}
RETURN t.id
I've intentionally removed the labels for t and y with the assumption that you can only follow User and no other kind of nodes. This removes a unnecessary label check.

JnBrymn,
How about this query?
MATCH (x:User)
WHERE x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}}
WITH x
MATCH (x)-[:FOLLOWS]->(t:User)
WITH t
MATCH (t)-[:FOLLOWS]->(y:User)
WHERE y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}}
RETURN t.id
Grace and peace,
Jim

Using Linq to bring back last 3,4...n orders for every customer

I have a database with customers orders.
I want to use Linq (to EF) to query the db to bring back the last(most recent) 3,4...n orders for every customer.
Note:
Customer 1 may have just made 12 orders in the last hr; but customer 2 may not have made any since last week.
I cant for the life of me work out how to write query in linq (lambda expressions) to get the data set back.
Any good ideas?
Edit:
Customers and orders is a simplification. The table I am querying is actually a record of outbound messages to various web services. It just seemed easer to describe as customers and orders. The relationship is the same.
I am building a task that checks the last n messages for each web service to see if there were any failures. We are wanting a semi real time Health status of the webservices.
#CoreySunwold
My table Looks a bit like this:
MessageID, WebserviceID, SentTime, Status, Message, Error,
Or from a customer/order context if it makes it easer:
OrderID, CustomerID, StatusChangedDate, Status, WidgetName, Comments
Edit 2:
I eventually worked out something
(Hat tip to #StephenChung who basically came up with the exact same, but in classic linq)
var q = myTable.Where(d => d.EndTime > DateTime.Now.AddDays(-1))
.GroupBy(g => g.ConfigID)
.Select(g =>new
{
ConfigID = g.Key,
Data = g.OrderByDescending(d => d.EndTime)
.Take(3).Select(s => new
{
s.Status,
s.SentTime
})
}).ToList();
It does take a while to execute. So I am not sure if this is the most efficient expression.

This should give the last 3 orders of each customer (if having orders at all):
from o in db.Orders
group o by o.CustomerID into g
select new {
CustomerID=g.Key,
LastOrders=g.OrderByDescending(o => o.TimeEntered).Take(3).ToList()
}
However, I suspect this will force the database to return the entire Orders table before picking out the last 3 for each customer. Check the SQL generated.
If you need to optimize, you'll have to manually construct a SQL to only return up to the last 3, then make it into a view.

You can use SelectMany for this purpose:
customers.SelectMany(x=>x.orders.OrderByDescending(y=>y.Date).Take(n)).ToList();

How about this? I know it'll work with regular collections but don't know about EF.
yourCollection.OrderByDescending(item=>item.Date).Take(n);

var ordersByCustomer =
db.Customers.Select(c=>c.Orders.OrderByDescending(o=>o.OrderID).Take(n));
This will return the orders grouped by customer.

var orders = orders.Where(x => x.CustomerID == 1).OrderByDescending(x=>x.Date).Take(4);
This will take last 4 orders. Specific query depends on your table / entity structure.
Btw: You can take x as a order. So you can read it like: Get orders where order.CustomerID is equal to 1, OrderThem by order.Date and take first 4 'rows'.

Somebody might correct me here, but i think doing this is linq with a single query is probably very difficult if not impossible. I would use a store procedure and something like this
select
*
,RANK() OVER (PARTITION BY c.id ORDER BY o.order_time DESC) AS 'RANK'
from
customers c
inner join
order o
on
o.cust_id = c.id
where
RANK < 10 -- this is "n"
I've not used this syntax for a while so it might not be quite right, but if i understand the question then i think this is the best approach.

How can this query be improved?

I am using LINQ to write a query - one query shows all active customers , and another shows all active as well as inactive customers.
if(showall)
{
var prod = Dataclass.Customers.Where(multiple factors ) (all inactive + active)
}
else
{
var prod = Dataclass.Customers.Where(multiple factors & active=true) (only active)
}
Can I do this using only one query? The issue is that, multiple factors are repeated in both the queries
thanks

var customers = Dataclass.Customers.Where(multiple factors);
var activeCust = customers.Where(x => x.active);
I really don't understand the question either. I wouldn't want to make this a one-liner because it would make the code unreadable

I'm assuming you are trying to minimze the number of roundtrips?
If "multiple factors" is the same, you can just filter for active users after your first query:
var onlyActive = prod.Where(p => p.active == true);

Wouldn't you just use your first query to return all customers?? If not you'd be returning the active users twice.

Options I'd consider
Bring all customers once, order by 'status' column so you can easily split them into two sets
Focus on minimizing DB roundtrips. Whatever you do in the front end costs an order of magnitude less than going to the DB.
Revise user requirements. For ex. consider paging on results - it's unlikely that end user will need all customers at once.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Linq query returns duplicate results when .Distinct() isn't used - why? - linq

Related

Linq query increase performance efficient query

Linq Query Where Contains

Slow Neo4j query despite indices

Using Linq to bring back last 3,4...n orders for every customer

How can this query be improved?

Categories

Resources