How to sort IEnumerable with limited result count? (another implementation of .OrderBy.Take) - performance

I have a binary file which contains more than 100 millions of objects and I read the file using BinaryReader and return (Yield) the object (File reader and IEnumerable implementation is here: Performance comparison of IEnumerable and raising event for each item in source? )
One of object's properties indicates the object rank (like A5). Assume that I want to get sorted top n objects based on the property.
I saw the code for OrderBy function: it uses QuickSort algorithm. I tried to sort the IEnumerable result with OrderBy and Take(n) function together, but I got OutOfMemory exception, because OrderBy function creates an array with size of total objects count to implement Quicksort.
Actually, the total memory I need is n so there is no need to create a big array. For instance, if I get Take(1000) it will return only 1000 objects and it doesn't depend on the total count of whole objects.
How can I get the result of OrderBy function with Take function? In another word, I need a limited or blocked sorted list with the capacity which is defined by end-user.

If you want top N from ordered source with default LINQ operators, then only option is loading all items into memory, sorting them and selecting first N results:
items.Sort(condition).Take(N) // Out of memory
If you want to sort only top N items, then simply take items first, and sort them:
items.Take(N).Sort(condition)
UPDATE you can use buffer for keeping N max ordered items:
public static IEnumerable<T> TakeOrdered<T, TKey>(
this IEnumerable<T> source, int count, Func<T, TKey> keySelector)
{
Comparer<T, TKey> comparer = new Comparer<T,TKey>(keySelector);
List<T> buffer = new List<T>();
using (var iterator = source.GetEnumerator())
{
while (iterator.MoveNext())
{
T current = iterator.Current;
if (buffer.Count == count)
{
// check if current item is less than minimal buffered item
if (comparer.Compare(current, buffer[0]) <= 0)
continue;
buffer.Remove(buffer[0]); // remove minimual item
}
// find index of current item
int index = buffer.BinarySearch(current, comparer);
buffer.Insert(index >= 0 ? index : ~index, current);
}
}
return buffer;
}
This solution also uses custom comparer for items (to compare them by keys):
public class Comparer<T, TKey> : IComparer<T>
{
private readonly Func<T, TKey> _keySelector;
private readonly Comparer<TKey> _comparer = Comparer<TKey>.Default;
public Comparer(Func<T, TKey> keySelector)
{
_keySelector = keySelector;
}
public int Compare(T x, T y)
{
return _comparer.Compare(_keySelector(x), _keySelector(y));
}
}
Sample usage:
string[] items = { "b", "ab", "a", "abcd", "abc", "bcde", "b", "abc", "d" };
var top5byLength = items.TakeOrdered(5, s => s.Length);
var top3byValue = items.TakeOrdered(3, s => s);

LINQ does not have a built-in class that lets you take the top n elements without loading the whole collection into memory, but you can definitely build it yourself.
One simple approach would be using a SortedDictionary of lists: keep adding elements to it until you hit the limit of n. After that, check each element that you are about to add with the smallest element that you have found so far (i.e. dict.Keys.First()). If the new element is smaller, discard it; otherwise, remove the smallest element, and add a new one.
At the end of the loop your sorted dictionary will have at most n elements, and they would be sorted according to the comparator that you set on the dictionary.

Related

How to use Stream, to write an efficient shuffling method

I have an ArrayList of Residence objects. Each Residence object has two fields, type::String, and price::BigInteger. I was wondering if there is an efficient way to restructure the list, in such a way, so no Residence object with the same name is next to each other. The goal is to write an efficient, shuffling method.
I suggest you use a HashMap<String, Stack<Residence>> and save the corresponding element for each type.
Then loop the hashmap through the keys in a Round Robin way and pop the item from the stack. Each item you get, you can add it to a new list.
Assuming your ArrayList of Residence is residences, the code should be something like this (Not tested, only for show the algorithm):
HashMap<String, Stak<Residence>> hm;
ArrayList<Residence> resultList = new ArrayList();
for (Residence r : residences) {
hm.put(r.type, r);
}
boolean exist = true;
while(exist) {
exist = false;
for(Map.Entry m : hm.entrySet()){
if(!m.getValue().isEmpty()) {
exist = true;
resultList.add(m.getValue().pop());
}
}
}

Most efficient way to determine if there are any differences between specific properties of 2 lists of items?

In C# .NET 4.0, I am struggling to come up with the most efficient way to determine if the contents of 2 lists of items contain any differences.
I don't need to know what the differences are, just true/false whether the lists are different based on my criteria.
The 2 lists I am trying to compare contain FileInfo objects, and I want to compare only the FileInfo.Name and FileInfo.LastWriteTimeUtc properties of each item. All the FileInfo items are for files located in the same directory, so the FileInfo.Name values will be unique.
To summarize, I am looking for a single Boolean result for the following criteria:
Does ListA contain any items with FileInfo.Name not in ListB?
Does ListB contain any items with FileInfo.Name not in ListA?
For items with the same FileInfo.Name in both lists, are the FileInfo.LastWriteTimeUtc values different?
Thank you,
Kyle
I would use a custom IEqualityComparer<FileInfo> for this task:
public class FileNameAndLastWriteTimeUtcComparer : IEqualityComparer<FileInfo>
{
public bool Equals(FileInfo x, FileInfo y)
{
if(Object.ReferenceEquals(x, y)) return true;
if (x == null || y == null) return false;
return x.FullName.Equals(y.FullName) && x.LastWriteTimeUtc.Equals(y.LastWriteTimeUtc);
}
public int GetHashCode(FileInfo fi)
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
hash = hash * 23 + fi.FullName.GetHashCode();
hash = hash * 23 + fi.LastWriteTimeUtc.GetHashCode();
return hash;
}
}
}
Now you can use a HashSet<FileInfo> with this comparer and HashSet<T>.SetEquals:
var comparer = new FileNameAndLastWriteTimeUtcComparer();
var uniqueFiles1 = new HashSet<FileInfo>(list1, comparer);
bool anyDifferences = !uniqueFiles1.SetEquals(list2);
Note that i've used FileInfo.FullName instead of Name since names aren't unqiue at all.
Sidenote: another advantage is that you can use this comparer for many LINQ methods like GroupBy, Except, Intersect or Distinct.
This is not the most efficient way (probably ranks a 4 out of 5 in the quick-and-dirty category):
var comparableListA = ListA.Select(a =>
new { Name = a.Name, LastWrite = a.LastWriteTimeUtc, Object = a});
var comparableListB = ListB.Select(b =>
new { Name = b.Name, LastWrite = b.LastWriteTimeUtc, Object = b});
var diffList = comparableListA.Except(comparableListB);
var youHaveDiff = diffList.Any();
Explanation:
Anonymous classes are compared by property values, which is what you're looking to do, which led to my thinking of doing a LINQ projection along those lines.
P.S.
You should double check the syntax, I just rattled this off without the compiler.

How to sort a list of strings by using the order of the items in another list?

I want to sort a list of strings (with possibly duplicate entries) by using as ordering reference the order of the entries in another list. So, the following list is the list I want to sort
List<String> list = ['apple','pear','apple','x','x','orange','x','pear'];
And the list that specifies the order is
List<String> order = ['orange','apple','x','pear'];
And the output should be
List<String> result = ['orange','apple','apple','x','x','x','pear','pear'];
Is there a clean way of doing this?
I don't understand if I can use list's sort and compare with the following problem. I tried using map, iterable, intersection, etc.
There might be a more efficient way but at least you get the desired result:
main() {
List<String> list = ['apple','pear','apple','x','x','orange','x','pear'];
List<String> order = ['orange','apple','x','pear'];
list.sort((a, b) => order.indexOf(a).compareTo(order.indexOf(b)));
print(list);
}
Try it on DartPad
The closure passed to list.sort(...) is a custom comparer which instead of comparing the passed item, compares their position in order and returns the result.
Using a map for better lookup performance:
main() {
List<String> list = ['apple','pear','apple','x','x','orange','x','pear'];
List<String> orderList = ['orange','apple','x','pear'];
Map<String,int> order = new Map.fromIterable(
orderList, key: (key) => key, value: (key) => orderList.indexOf(key));
list.sort((a, b) => order[a].compareTo(order[b]));
print(list);
}
Try it on DartPad

Item-by-item list comparison, updating each item with its result (no third list)

The solutions I have found so far in my research on comparing lists of objects have usually generated a new list of objects, say of those items existing in one list, but not in the other. In my case, I want to compare two lists to discover the items whose key exists in one list and not the other (comparing both ways), and for those keys found in both lists, checking whether the value is the same or different.
The object being compared has multiple properites that constitute the key, plus a property that constitutes the value, and finally, an enum property that describes the result of the comparison, e.g., {Equal, NotEqual, NoMatch, NotYetCompared}. So my object might look like:
class MyObject
{
//Key combination
string columnA;
string columnB;
decimal columnC;
//The Value
decimal columnD;
//Enum for comparison, used for styling the item (value hidden from UI)
//Alternatively...this could be a string type, holding the enum.ToString()
MyComparisonEnum result;
}
These objects are collected into two ObservableCollection<MyObject> to be compared. When bound to the UI, the grid rows are being styled based on the caomparison result enum, so the user can easily see what keys are in the new dataset but not in the old, vice-versa, along with those keys in both datasets with a different value. Both lists are presented in the UI in data grids, with the rows styled based on the comparison result.
Would LINQ be suitable as a tool to solve this efficiently, or should I use loops to scan the lists and break out when the key is found, etc (a solution like this comes naturally to be from my procedural programming background)... or some other method?
Thank you!
You can use Except and Intersect:
var list1 = new List<MyObject>();
var list2 = new List<MyObject>();
// initialization code
var notIn2 = list1.Except(list2);
var notIn1 = list2.Except(list1);
var both = list1.Intersect(list2);
To find objects with different values (ColumnD) you can use this (quite efficient) Linq query:
var diffValue = from o1 in list1
join o2 in list2
on new { o1.columnA, o1.columnB, o1.columnC } equals new { o2.columnA, o2.columnB, o2.columnC }
where o1.columnD != o2.columnD
select new { Object1 = o1, Object2 = o2 };
foreach (var diff in diffValue)
{
MyObject obj1 = diff.Object1;
MyObject obj2 = diff.Object2;
Console.WriteLine("Obj1-Value:{0} Obj2-Value:{1}", obj1.columnD, obj2.columnD);
}
when you override Equals and GetHashCode appropriately:
class MyObject
{
//Key combination
string columnA;
string columnB;
decimal columnC;
//The Value
decimal columnD;
//Enum for comparison, used for styling the item (value hidden from UI)
//Alternatively...this could be a string type, holding the enum.ToString()
MyComparisonEnum result;
public override bool Equals(object obj)
{
if (obj == null || !(obj is MyObject)) return false;
MyObject other = (MyObject)obj;
return columnA.Equals(other.columnA) && columnB.Equals(other.columnB) && columnC.Equals(other.columnC);
}
public override int GetHashCode()
{
int hash = 19;
hash = hash + (columnA ?? "").GetHashCode();
hash = hash + (columnB ?? "").GetHashCode();
hash = hash + columnC.GetHashCode();
return hash;
}
}

Linq: Ignore non joinable items from two lists without throwing error?

In the code below, the two lists are joined on Index. But either list could have more items than the other and i just want to join up to the list with the least items and throw the rest out from the other list. So, if list 1 has 5 items and list 2 has 7 items, I want to join both up to item 5, and ignore list 2's remaining items. (and vice versa)
var joinLbxs = lbxShtCols.Items
.Cast<ListItem>()
.Select((xlFldList, index) => new
{
xlFldList,
tblFldList = lbxSqlTablesCols.Items[index]
});
Zip is not too complicated to implement by yourself.
public static IEnumerable<TResult> Zip<TSource, TOther, TResult>(
this IEnumerable<TSource> source,
IEnumerable<TOther> other,
Func<TSource, TOther, TResult> resultSelector)
{
using (var e1 = source.GetEnumerator())
{
using (var e2 = other.GetEnumerator())
{
while (e1.MoveNext() && e2.MoveNext())
{
yield return resultSelector(e1.Current, e2.Current);
}
}
}
}
As #Steven suggested in a comment, if you are using .Net 4.0, use the Zip() method. If you don't, you could use MoreLinq to provide the same functionality.
Or you could do it yourself (assuming both lists are IList<T> and have fast indexer):
from i in Enumerable.Range(0, new[] { list1.Count, list2.Count }.Min())
select new
{
item1 = list1[i],
item2 = list2[i]
}
Try intersecting the 2 lists; that'll give you the common items. Take() the number smallest of the 2 lists that you want. It's not clear whether you know which list would have the smallest (by convention), so decide that beforehand. Optionally sort the list if you need BEFORE the Take().
int numToTake = (lbxShtCols.Count >= lbxSqlTablesCols.Count)
?lbxShtCols.Count
:lbxSqlTablesCols.Count;
var commons = lbxShtCols.Intersect(lbxSqlTablesCols)
.Take(numToTake);

Resources