glib: sort by multiple attributes - sorting

I am using glib to sort:
gint cmp_values_by_attr1(gpointer a, gpointer b) {
my_strcut *seq_a = *((my_strcut**) a);
my_strcut *seq_b = *((my_strcut**) b);
return (seq_a->attr1 - seq_b->attr1);
}
values = g_ptr_array_sized_new(4);
v = new_struct();
g_ptr_array_add(values, v);
...
g_ptr_array_sort(values, (GCompFunction) cmp_values_by_attr1);
Now inside my array, I would like to sort first by attr1 and then by attr2. How to implement?

It's quite simple—the comparison function returns less than, equal to, or greater than zero depending on whether the first value is less than, equal to, or greater than the second value. All you need to do is compare the first attributes and if the return value is not equal to zero return the result, otherwise compare the second attributes and return the result:
gint comp_values (gpointer a, gpointer b) {
gint res;
my_strcut *seq_a = *((my_strcut**) a);
my_strcut *seq_b = *((my_strcut**) b);
res = seq_a->attr1 - seq_b->attr1;
if (res == 0) {
res = (seq_a->attr2 - seq_b->attr2);
}
return res;
}

I have used some 'hack' to implement this sorting.
Make the two attributes to sort to be uint32_t types.
Add another attribute uint64_t for_sort to my struct
Shift the upper 32 bits of for_sort to be attr1
The lower 32 bits of for_sort to be attr2
Sort the array by for_sort, then the items are first sorted by attr1, then by attr2.
I have implemented and verified that it is working.
Disadvantages:
Extra memory to add for_sort
Extra processing
If the sorting attributes are in other types, need to change accordingly.

Related

storing map<string, struct> into vector to sort

I have the following code that I am trying. I am trying to sort it by ascending or descending size, and from z-a (three different sorts). I can't figure out how to even store it in the vector let alone sort it. Thanks for the help!
struct countSize {
int count;
uintmax_t size;
void sortMap(map<string, countSize> &extCount)
{
// Copy
vector<string, countSize> v(extCount.begin(), extCount.end());
// Sort the vector according to either file size or desc alphabetically
//print
}
int main()
{
map<string, countSize> mp;
mp["hello"] = { 1, 200 };
mp["Ace"] = { 5, 600 };
mp["hi"] = { 3, 300 };
mp["br"] = { 2, 100 };
sortMap(mp);
}
If you iterate over a map, you get a stream of std::pair<const X, Y>. That's a bit awkward for storing in a vector, because of the const. One solution is to just drop the const:
using my_map = std::map<std::string, countSize>;
// Mutable element type.
using my_map_element = std::pair<typename my_map::key_type,
typename my_map::mapped_type>;
using my_element_list = std::vector<my_map_element>;
Then it's very straight-forward to build a vector and sort it. Here, we use a template for the comparison function which makes it a lot easier to use a lambda for the comparator:
template<typename Functor>
my_element_list sortMap(const my_map& the_map, Functor compare) {
my_element_list v(the_map.begin(), the_map.end());
std::sort(v.begin(), v.end(), compare);
return v;
}
Unlike your code, this returns the sorted list. The caller can print the list if desired. See, for example, the example live on Coliru.
That's not really ideal, though. If the individual elements of the map are at all complicated, it may well be more efficient to make a vector of pointers to the elements, rather than copies of the elements. Amongst other things, this does not require readjusting the element type, and that makes it possible to be agnostic about the base container type as well. However, you need to remember that the comparison functor will now receive pointers to the elements to be compared. See the modified example.

Algorithm to match sets with overlapping members

Looking for an efficient algorithm to match sets among a group of sets, ordered by the most overlapping members. 2 identical sets for example are the best match, while no overlapping members are the worst.
So, the algorithm takes input a list of sets and returns matching set pairs ordered by the sets with the most overlapping members.
Would be interested in ideas to do this efficiently. Brute force approach is to try all combinations and sort which obviously is not very performant when the number of sets is very large.
Edit: Use case - Assume a large number of sets already exist. When a new set arrives, the algorithm is run and the output includes matching sets (with at least one element overlap) sorted by the most matching to least (doesn't matter how many items are in the new/incoming set). Hope that clarifies my question.
If you can afford an approximation algorithm with a chance of error, then you should probably consider MinHash.
This algorithm allows estimating the similarity between 2 sets in constant time. For any constructed set, a fixed size signature is computed, and then only the signatures are compared when estimating the similarities. The similarity measure being used is Jaccard distance, which ranges from 0 (disjoint sets) to 1 (identical sets). It is defined as the intersection to union ratio of two given sets.
With this approach, any new set has to be compared against all existing ones (in linear time), and then the results can be merged into the top list (you can use a bounded search tree/heap for this purpose).
Since the number of possible different values is not very large, you get a fairly efficient hashing if you simply set the nth bit in a "large integer" when the nth number is present in your set. You can then look for overlap between sets with a simple bitwise AND followed by a "count set bits" operation. On 64 bit architecture, that means that you can look for the similarity between two numbers (out of 1000 possible values) in about 16 cycles, regardless of the number of values in each cluster. As the cluster gets more sparse, this becomes a less efficient algorithm.
Still - I implemented some of the basic functions you might need in some code that I attach here - not documented but reasonably understandable, I think. In this example I made the numbers small so I can check the result by hand - you might want to change some of the #defines to get larger ranges of values, and obviously you will want some dynamic lists etc to keep up with the growing catalog.
#include <stdio.h>
// biggest number you will come across: want this to be much bigger
#define MAXINT 25
// use the biggest type you have - not int
#define BITSPER (8*sizeof(int))
#define NWORDS (MAXINT/BITSPER + 1)
// max number in a cluster
#define CSIZE 5
typedef struct{
unsigned int num[NWORDS]; // want to use longest type but not for demo
int newmatch;
int rank;
} hmap;
// convert number to binary sequence:
void hashIt(int* t, int n, hmap* h) {
int ii;
for(ii=0;ii<n;ii++) {
int a, b;
a = t[ii]%BITSPER;
b = t[ii]/BITSPER;
h->num[b]|=1<<a;
}
}
// print binary number:
void printBinary(int n) {
unsigned int jj;
jj = 1<<31;
while(jj!=0) {
printf("%c",((n&jj)!=0)?'1':'0');
jj>>=1;
}
printf(" ");
}
// print the array of binary numbers:
void printHash(hmap* h) {
unsigned int ii, jj;
for(ii=0; ii<NWORDS; ii++) {
jj = 1<<31;
printf("0x%08x: ", h->num[ii]);
printBinary(h->num[ii]);
}
//printf("\n");
}
// find the maximum overlap for set m of n
int maxOverlap(hmap* h, int m, int n) {
int ii, jj;
int overlap, maxOverlap = -1;
for(ii = 0; ii<n; ii++) {
if(ii == m) continue; // don't compare with yourself
else {
overlap = 0;
for(jj = 0; jj< NWORDS; jj++) {
// just to see what's going on: take these print statements out
printBinary(h->num[ii]);
printBinary(h->num[m]);
int bc = countBits(h->num[ii] & h->num[m]);
printBinary(h->num[ii] & h->num[m]);
printf("%d bits overlap\n", bc);
overlap += bc;
}
if(overlap > maxOverlap) maxOverlap = overlap;
}
}
return maxOverlap;
}
int countBits (unsigned int b) {
int count;
for (count = 0; b != 0; count++) {
b &= b - 1; // this clears the LSB-most set bit
}
return count;
}
int main(void) {
int cluster[20][CSIZE];
int temp[CSIZE];
int ii,jj;
static hmap H[20]; // make them all 0 initially
for(jj=0; jj<20; jj++){
for(ii=0; ii<CSIZE; ii++) {
temp[ii] = rand()%MAXINT;
}
hashIt(temp, CSIZE, &H[jj]);
}
for(ii=0;ii<20;ii++) {
printHash(&H[ii]);
printf("max overlap: %d\n", maxOverlap(H, ii, 20));
}
}
See if this helps at all...

Can a redblack tree cope with this comparison function?

I was thinking to use a RedBlack tree that does not support multiple insertion of the same key using a comparison function similar to this one:
int compare(MyObject A, MyObject B)
{
if (A.error > B.error) return 1;
if (A.error < B.error) return -1;
if (A.name == B.name) return 0;
return 1;
}
this trick would be useful to have multiple items with the same error, but different "name". If two items with the same error are found, but the value does not coincide, the comparing item is just treated as "bigger".
I am pretty sure that this trick would work with a normal BST...but I am having troubles with a redblack tree. I do not know the redblack tree algorithm, I am using an implementation, so I wonder if there is any reason why this should not work.
P.S.: name does not have a comparison relationship...so the only thing I can do is to check if they are the same.
P.P.S.: assuming that this does not work and knowing that I cannot have a order relation between the "name" values, what other possibilities do I have? I could use a data structure that allow to insert multiple values with the same key, but that won't work for me, because when I delete a value, I must be sure that I am deleting the value I am actually passing (basically for me the key and the value hare the same thing, I need a sort of ordered multiset data structure!)
Your binary search trees expect your comparison function to obey the usual rules for a total ordering over the elements you are going to insert into the tree. Your current comparison function doesn't obey this because if you have objects A and B with the same error key but different value keys then according to compare you both A < B and B < A are valid.
I think it should all work correctly if you change your comparison function to
int compare(MyObject A, MyObject B)
{
if (A.error > B.error) return 1;
if (A.error < B.error) return -1;
if (A.value > B.value) return 1;
if (A.value < B.value) return -1;
return 0;
}
You did not define an order relation.
In your case, your objects are two-dimensional. I understand from your question that the priority in ordering should be given to the error field. Thus, an order relation (using lexicographic order) should be :
struct my_object {
int error;
int value;
};
int compare(struct my_object *a, struct my_object *b)
{
int ret;
if (!a) {
return 1;
}
else if (!b) {
return -1;
}
ret = a->error - b->error;
if (!ret) {
ret = a->value - b->value;
}
return ret;
}

Find out which combinations of numbers in a set add up to a given total

I've been tasked with helping some accountants solve a common problem they have - given a list of transactions and a total deposit, which transactions are part of the deposit? For example, say I have this list of numbers:
1.00
2.50
3.75
8.00
And I know that my total deposit is 10.50, I can easily see that it's made up of the 8.00 and 2.50 transaction. However, given a hundred transactions and a deposit in the millions, it quickly becomes much more difficult.
In testing a brute force solution (which takes way too long to be practical), I had two questions:
With a list of about 60 numbers, it seems to find a dozen or more combinations for any total that's reasonable. I was expecting a single combination to satisfy my total, or maybe a few possibilities, but there always seem to be a ton of combinations. Is there a math principle that describes why this is? It seems that given a collection of random numbers of even a medium size, you can find a multiple combination that adds up to just about any total you want.
I built a brute force solution for the problem, but it's clearly O(n!), and quickly grows out of control. Aside from the obvious shortcuts (exclude numbers larger than the total themselves), is there a way to shorten the time to calculate this?
Details on my current (super-slow) solution:
The list of detail amounts is sorted largest to smallest, and then the following process runs recursively:
Take the next item in the list and see if adding it to your running total makes your total match the target. If it does, set aside the current chain as a match. If it falls short of your target, add it to your running total, remove it from the list of detail amounts, and then call this process again
This way it excludes the larger numbers quickly, cutting the list down to only the numbers it needs to consider. However, it's still n! and larger lists never seem to finish, so I'm interested in any shortcuts I might be able to take to speed this up - I suspect that even cutting 1 number out of the list would cut the calculation time in half.
Thanks for your help!
This special case of the Knapsack problem is called Subset Sum.
C# version
setup test:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main(string[] args)
{
// subtotal list
List<double> totals = new List<double>(new double[] { 1, -1, 18, 23, 3.50, 8, 70, 99.50, 87, 22, 4, 4, 100.50, 120, 27, 101.50, 100.50 });
// get matches
List<double[]> results = Knapsack.MatchTotal(100.50, totals);
// print results
foreach (var result in results)
{
Console.WriteLine(string.Join(",", result));
}
Console.WriteLine("Done.");
Console.ReadKey();
}
}
code:
using System.Collections.Generic;
using System.Linq;
public class Knapsack
{
internal static List<double[]> MatchTotal(double theTotal, List<double> subTotals)
{
List<double[]> results = new List<double[]>();
while (subTotals.Contains(theTotal))
{
results.Add(new double[1] { theTotal });
subTotals.Remove(theTotal);
}
// if no subtotals were passed
// or all matched the Total
// return
if (subTotals.Count == 0)
return results;
subTotals.Sort();
double mostNegativeNumber = subTotals[0];
if (mostNegativeNumber > 0)
mostNegativeNumber = 0;
// if there aren't any negative values
// we can remove any values bigger than the total
if (mostNegativeNumber == 0)
subTotals.RemoveAll(d => d > theTotal);
// if there aren't any negative values
// and sum is less than the total no need to look further
if (mostNegativeNumber == 0 && subTotals.Sum() < theTotal)
return results;
// get the combinations for the remaining subTotals
// skip 1 since we already removed subTotals that match
for (int choose = 2; choose <= subTotals.Count; choose++)
{
// get combinations for each length
IEnumerable<IEnumerable<double>> combos = Combination.Combinations(subTotals.AsEnumerable(), choose);
// add combinations where the sum mathces the total to the result list
results.AddRange(from combo in combos
where combo.Sum() == theTotal
select combo.ToArray());
}
return results;
}
}
public static class Combination
{
public static IEnumerable<IEnumerable<T>> Combinations<T>(this IEnumerable<T> elements, int choose)
{
return choose == 0 ? // if choose = 0
new[] { new T[0] } : // return empty Type array
elements.SelectMany((element, i) => // else recursively iterate over array to create combinations
elements.Skip(i + 1).Combinations(choose - 1).Select(combo => (new[] { element }).Concat(combo)));
}
}
results:
100.5
100.5
-1,101.5
1,99.5
3.5,27,70
3.5,4,23,70
3.5,4,23,70
-1,1,3.5,27,70
1,3.5,4,22,70
1,3.5,4,22,70
1,3.5,8,18,70
-1,1,3.5,4,23,70
-1,1,3.5,4,23,70
1,3.5,4,4,18,70
-1,3.5,8,18,22,23,27
-1,3.5,4,4,18,22,23,27
Done.
If subTotals are repeated, there will appear to be duplicate results (the desired effect). In reality, you will probably want to use the subTotal Tupled with some ID, so you can relate it back to your data.
If I understand your problem correctly, you have a set of transactions, and you merely wish to know which of them could have been included in a given total. So if there are 4 possible transactions, then there are 2^4 = 16 possible sets to inspect. This problem is, for 100 possible transactions, the search space has 2^100 = 1267650600228229401496703205376 possible combinations to search over. For 1000 potential transactions in the mix, it grows to a total of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
sets that you must test. Brute force will hardly be a viable solution on these problems.
Instead, use a solver that can handle knapsack problems. But even then, I'm not sure that you can generate a complete enumeration of all possible solutions without some variation of brute force.
There is a cheap Excel Add-in that solves this problem: SumMatch
The Excel Solver Addin as posted over on superuser.com has a great solution (if you have Excel) https://superuser.com/questions/204925/excel-find-a-subset-of-numbers-that-add-to-a-given-total
Its kind of like 0-1 Knapsack problem which is NP-complete and can be solved through dynamic programming in polynomial time.
http://en.wikipedia.org/wiki/Knapsack_problem
But at the end of the algorithm you also need to check that the sum is what you wanted.
Depending on your data you could first look at the cents portion of each transaction. Like in your initial example you know that 2.50 has to be part of the total because it is the only set of non-zero cent transactions which add to 50.
Not a super efficient solution but heres an implementation in coffeescript
combinations returns all possible combinations of the elements in list
combinations = (list) ->
permuations = Math.pow(2, list.length) - 1
out = []
combinations = []
while permuations
out = []
for i in [0..list.length]
y = ( 1 << i )
if( y & permuations and (y isnt permuations))
out.push(list[i])
if out.length <= list.length and out.length > 0
combinations.push(out)
permuations--
return combinations
and then find_components makes use of it to determine which numbers add up to total
find_components = (total, list) ->
# given a list that is assumed to have only unique elements
list_combinations = combinations(list)
for combination in list_combinations
sum = 0
for number in combination
sum += number
if sum is total
return combination
return []
Heres an example
list = [7.2, 3.3, 4.5, 6.0, 2, 4.1]
total = 7.2 + 2 + 4.1
console.log(find_components(total, list))
which returns [ 7.2, 2, 4.1 ]
#include <stdio.h>
#include <stdlib.h>
/* Takes at least 3 numbers as arguments.
* First number is desired sum.
* Find the subset of the rest that comes closest
* to the desired sum without going over.
*/
static long *elements;
static int nelements;
/* A linked list of some elements, not necessarily all */
/* The list represents the optimal subset for elements in the range [index..nelements-1] */
struct status {
long sum; /* sum of all the elements in the list */
struct status *next; /* points to next element in the list */
int index; /* index into elements array of this element */
};
/* find the subset of elements[startingat .. nelements-1] whose sum is closest to but does not exceed desiredsum */
struct status *reportoptimalsubset(long desiredsum, int startingat) {
struct status *sumcdr = NULL;
struct status *sumlist = NULL;
/* sum of zero elements or summing to zero */
if (startingat == nelements || desiredsum == 0) {
return NULL;
}
/* optimal sum using the current element */
/* if current elements[startingat] too big, it won't fit, don't try it */
if (elements[startingat] <= desiredsum) {
sumlist = malloc(sizeof(struct status));
sumlist->index = startingat;
sumlist->next = reportoptimalsubset(desiredsum - elements[startingat], startingat + 1);
sumlist->sum = elements[startingat] + (sumlist->next ? sumlist->next->sum : 0);
if (sumlist->sum == desiredsum)
return sumlist;
}
/* optimal sum not using current element */
sumcdr = reportoptimalsubset(desiredsum, startingat + 1);
if (!sumcdr) return sumlist;
if (!sumlist) return sumcdr;
return (sumcdr->sum < sumlist->sum) ? sumlist : sumcdr;
}
int main(int argc, char **argv) {
struct status *result = NULL;
long desiredsum = strtol(argv[1], NULL, 10);
nelements = argc - 2;
elements = malloc(sizeof(long) * nelements);
for (int i = 0; i < nelements; i++) {
elements[i] = strtol(argv[i + 2], NULL , 10);
}
result = reportoptimalsubset(desiredsum, 0);
if (result)
printf("optimal subset = %ld\n", result->sum);
while (result) {
printf("%ld + ", elements[result->index]);
result = result->next;
}
printf("\n");
}
Best to avoid use of floats and doubles when doing arithmetic and equality comparisons btw.

Suggest a good method with least lookup time complexity

I have a structure which has 3 identifier fields and one value field. I have a list of these objects. To give an analogy, the identifier fields are like the primary keys to the object. These 3 fields uniquely identify an object.
Class
{
int a1;
int a2;
int a3;
int value;
};
I would be having a list of say 1000 object of this datatype. I need to check for specific values of these identity key values by passing values of a1, a2 and a3 to a lookup function which would check if any object with those specific values of a1, a2 and a3 is present and returns that value. What is the most effective way to implement this to achieve a best lookup time?
One solution I could think of is to have a 3 dimensional matrix of length say 1000 and populate the value in it. This has a lookup time of O(1). But the disadvantages are.
1. I need to know the length of array.
2. For higher identity fields (say 20), then I will need a 20 dimension matrix which would be an overkill on the memory. For my actual implementation, I have 23 identity fields.
Can you suggest a good way to store this data which would give me the best look up time?
Create a key class that contains all the identity fields, and define an appropriate equals function and hash method, and then use a hash map to map from the key class to its associated value. This will give you a time complexity of O(1) per lookup in the expected case, and it only requires space proportional to the number of actual key combinations observed (typically twice the number, although you can adjust the constant for the time/space tradeoff that you desire), rather than space proportional to all possible key combinations.
Use hash table (map). Construct the key to be "a1-a2-a3", and store data to H(key)=data.
I would simply sort the array by key, and use a binary search.
(untested)
int compare_entry(ENTRY *k1, ENTRY *k2) {
int d = k1->a1 - k2->a1;
if (d == 0) {
d = k1->a2 - k2->a2;
if (d == 0) {
d = k1->a3 - k2->a3;
}
}
return d; // >0 is k1 > k2, 0 if k1 == k2, <0 if k1 < k2
}
// Derived from Wikipedia
int find(ENTRY *list, int size, ENTRY *value) {
int low = 0;
int n = size - 1;
int high = n;
while (low < high) {
int mid = low + (high - low) / 2
int cmp = compare_entry(&list[mid], value);
if (cmp < 0) {
low = mid + 1;
} else {
high = mid;
}
}
if (low < n) {
int cmp = compare_entry(&list[low], value);
if (cmp == 0) {
return low; // found item at 'low' index
}
} else {
return -1; // not found
}
}
Absolutely worst case, you run through this thing, what, 10 times, and end up actually doing all of the comparisons in the key comparison. So that's, what, 85 integer math operations (additions, subtraction, and 1 shift)?
if your a1-a3 are ranging 0-100, then you can make your key a1 * 10000 + a2 * 100 + a3, and do a single compare, and worst case is 63 integer math operations. And your entire array fits within cache on most any modern processor. And it's memory efficient.
You can burn memory with a perfect hash or some other sparse matrix. Even with a perfect hash, I bet the hash calculation itself is competitive with this time, considering multiplication is expensive. This hits the memory bus harder, obviously.

Resources