How to sort multiple columns: CSV? c++ - sorting

I am attempting to sort a CSV file by specifying which column order to sort in:
for example: ./csort 3, 1, 5 < DATA > SORTED_DATA
or ./csort 3, 4, 6, 2, 1, 5 < DATA ...
example line of DATA: 177,27,2,42,285,220
I used a vector split(string str) function to store the columns specified in the arguments which require sorting. Creating a vector:
vector<string> columns {3, 1, 5}; // for example
Not entirely sure how to use this columns vector to proceed with the sorting process; though, I am aware that I could use sort.
sort(v.begin(), v.end(), myfunction);

As I understand your question, you have already parsed your data into 4 vectors, 1 vector per column, and you want to be able to sort your data, specifying the prececedence of the column to sort -- i.e. sort by col1, then col3, then col4...
What you want to do isn't too difficult, but you'll have to backtrack a bit. There are multiple ways to approach the problem, but here's a rough outline. Based on the level of expertise you exhibit in your question, you might have to look a few terms in the following outline, but if you do you'll have a good flexible solution to your problem.
You want to store your data by row, since you want to sort rows... 4 vector for 4 columns won't help you here. If all 4 elements in the row are going to be a the same type, you could use a std::vector or std::array for the row. std::array is solid if # cols is known compile time, std::vector for runtime. If the types are inhomogeneous, you could use a tuple, or a struct. Whatever type you use, let's call it RowT.
Parse and store into your rows, make a vector of RowT.
Define a function-object which provides the () operator for a left and right hand side of RowT. It must implement the "less than operation" following the precedence you want. Lets call that class CustomSorter.
Once you have that in place, your final sort will be:
CustomSorter cs(/*precedence arguments*/);
std::sort(rows.begin(), rows.end(), cs);
Everything is really straightforward, a basic example can bee seen here in the customsort example. In my experience the only part you will have to work at is the sort algorithm itself.

The easiest way is to use a class that has a list of indexes as a member, and go through the list in order to see if the item is less than the other.
class VecLess
{
std::vector<int> indexes;
public:
VecLess(std::vector<int> init) : indexes(init)
{
}
bool operator()(const std::vector<string> & lhs, const std::vector<string> rhs)
{
for (auto i = indexes.begin(); i != indexes.end(); ++i)
{
if (lhs[*i] < rhs[*i])
return true;
if (rhs[*i] < lhs[*i])
return false;
}
return false;
}
};

Related

Eigen cast with auto return type - Less efficient than explicit return type?

When casting a vector integers (i.e. Eigen::VectorXi) to a vector of doubles, and then operating on that vector of doubles, the generated assembly is dramatically different if the return type of the cast is auto.
In other words, using:
Eigen::VectorXi int_vec(3);
int_vec << 1, 2, 3;
Eigen::VectorXd dbl_vec = int_vec.cast<double>();
Compared to:
Eigen::VectorXi int_vec(3);
int_vec << 1, 2, 3;
auto dbl_vec = int_vec.cast<double>();
Here are two examples on godbolt:
VectorXd return type: https://godbolt.org/z/0FLC4r
auto return type: https://godbolt.org/z/MGxCaL
What are the ramifications of using auto for the return here? I thought it would be more efficient by avoiding a copy, but now I'm not sure.
Indeed, in your code in the question you avoid a copy (indeed, until dbl_vec is used, it's essentially a noop). However, in the code on godbolt, you traverse the original int_vec and evaluate dbl_vec at least twice, possibly thrice:
max + std::log((dbl_vec.array() - max)
^^^ ^^^^^^^ ^^^
I'm not sure if the two calls to max are collapsed into a temporary or not. I'd hope so.
In any case, kmdreko is right and you should avoid using auto with Eigen unless you know exactly what you're doing. In this case, the auto is an expression template that does not get evaluated until used. If you use it more than once, then it gets evaluated more than once. If the evaluation is expensive, then the savings from not using a copy are lost (with interest) to the additional evaluation times.

Reordering members in a template by alignment

Assume I write the following code:
template<typename T1, typename T2>
struct dummy {
T1 first;
T2 second;
};
I would like to know in general how I can order members in a template class by descending size. In other words, I would like the above class to be
struct dummy {
int first;
char second;
};
when instantiated as dummy<int, char>. However, I would like to obtain
struct dummy {
int second;
char first;
};
in the case dummy<char, int>.
On most platforms, padding for std::pair occurs only at "natural" alignment. This sort of padding will end up the same for either order.
For std::tuple, some arrangements can be more efficient than others, but the library can choose any memory layout it likes, so any TMP you add on top is only second-guessing.
In general, yes, you can define a sorting algorithm using templates, but it would be a fair bit of work.
This can be done, the only issue is the naming, how would you name your fields ??
I did what you are asking for not long time ago, I used std::tuple, and some meta-programming skills, I did a merge sort to reorder the template arguments, It is really fun to do (if you like functionnal programming).
For the naming I used some Macro to access the fields.
I really encourage you to do it by yourself, it is really interesting intellectually, however if you like to see some code, please tell me !

Create 3rd vector while looping through 2 others

I'm totally newbie in C++ and I need to solve a problem with vectors. What I need is to merge two existing vectors and create third one. While I saw several answers, the difference here is I need vector #3 (values3) to contain not all values, but only those which are in both vectors #1 (values1) and #2 (values2). So, if integer 2 is in vector 1 but is not in vector 2, this number does not fit me. I should use a function provided below. Commented lines are which I don't know what to write in. Other lines are working.
void CommonValues(vector<MainClass> & values1, vector<MainClass> & values2, vector<MainClass> & values3)
{
MainClass Class;
string pav;
int kiek;
vector<MainClass>::iterator iter3; // ?
for (vector<MainClass>::iterator iter1 = values1.begin(); iter1 != values1.end(); iter1++)
{
for (vector<MainClass>::iterator iter2 = values2.begin(); iter2 != values2.end(); iter2++)
{
if (iter1 == iter2)
{
pav = iter2->TakePav();
iter3->TakePav(pav); // ?
kiek = iter1->TakeKiek() + iter2->TakeKiek();
iter3->TakeKie(kiek); // ?
iter3++; // ?
}
}
}
}
You can sort values1 and values2, then use std::intersection: http://en.cppreference.com/w/cpp/algorithm/set_intersection
Your code at the moment won't work, among other problems, you are comparing iterator from vector 1 with iterator from vector 2, which doesn't make any sense. If you want to do it by looping, you should iterate through one vector and check if the value, for example *iter1, is in the 2nd vector.

Good algorithm to turn stl map into sorted list of the keys based on a numeric value

I have a stl map that's of type:
map<Object*, baseObject*>
where
class baseObject{
int ID;
//other stuff
};
If I wanted to return a list of objects (std::list< Object* >), what's the best way to sort it in order of the baseObject.ID's?
Am I just stuck looking through for every number or something? I'd prefer not to change the map to a boost map, although I wouldn't be necessarily against doing something that's self contained within a return function like
GetObjectList(std::list<Object*> &objects)
{
//sort the map into the list
}
Edit: maybe I should iterate through and copy the obj->baseobj into a map of baseobj.ID->obj ?
What I'd do is first extract the keys (since you only want to return those) into a vector, and then sort that:
std::vector<baseObject*> out;
std::transform(myMap.begin(), myMap.end(), std::back_inserter(out), [](std::pair<Object*, baseObject*> p) { return p.first; });
std::sort(out.begin(), out.end(), [&myMap](baseObject* lhs, baseObject* rhs) { return myMap[lhs].componentID < myMap[rhs].componentID; });
If your compiler doesn't support lambdas, just rewrite them as free functions or function objects. I just used lambdas for conciseness.
For performance, I'd probably reserve enough room in the vector initially, instead of letting it gradually expand.
(Also note that I haven't tested the code, so it might need a little bit of fiddling)
Also, I don't know what this map is supposed to represent, but holding a map where both key and value types are pointers really sets my "bad C++" sense tingling. It smells of manual memory management and muddled (or nonexistent) ownership semantics.
You mentioned getting the output in a list, but a vector is almost certainly a better performing option, so I used that. The only situation where a list is preferable is really when you have no intention of ever iterating over it, and if you need the guarantee that pointers and iterators stay valid after modification of the list.
The first thing is that I would not use a std::list, but rather a std::vector. Now as of the particular problem you need to perform two operations: generate the container, sort it by whatever your criteria is.
// Extract the data:
std::vector<Object*> v;
v.reserve( m.size() );
std::transform( m.begin(), m.end(),
std::back_inserter(v),
[]( const map<Object*, baseObject*>::value_type& v ) {
return v.first;
} );
// Order according to the values in the map
std::sort( v.begin(), v.end(),
[&m]( Object* lhs, Object* rhs ) {
return m[lhs]->id < m[rhs]->id;
} );
Without C++11 you will need to create functors instead of the lambdas, and if you insist in returning a std::list then you should use std::list<>::sort( Comparator ). Note that this is probably inefficient. If performance is an issue (after you get this working and you profile and know that this is actually a bottleneck) you might want to consider using an intermediate map<int,Object*>:
std::map<int,Object*> mm;
for ( auto it = m.begin(); it != m.end(); ++it )
mm[ it->second->id ] = it->first;
}
std::vector<Object*> v;
v.reserve( mm.size() ); // mm might have less elements than m!
std::transform( mm.begin(), mm.end(),
std::back_inserter(v),
[]( const map<int, Object*>::value_type& v ) {
return v.second;
} );
Again, this might be faster or slower than the original version... profile.
I think you'll do fine with:
GetObjectList(std::list<Object*> &objects)
{
std::vector <Object*> vec;
vec.reserve(map.size());
for(auto it = map.begin(), it_end = map.end(); it != it_end; ++it)
vec.push_back(it->second);
std::sort(vec.begin(), vec.end(), [](Object* a, Object* b) { return a->ID < b->ID; });
objects.assign(vec.begin(), vec.end());
}
Here's how to do what you said, "sort it in order of the baseObject.ID's":
typedef std::map<Object*, baseObject*> MapType;
MapType mymap; // don't care how this is populated
// except that it must not contain null baseObject* values.
struct CompareByMappedId {
const MapType &map;
CompareByMappedId(const MapType &map) : map(map) {}
bool operator()(Object *lhs, Object *rhs) {
return map.find(lhs)->second->ID < map.find(rhs)->second->ID;
}
};
void GetObjectList(std::list<Object*> &objects) {
assert(objects.empty()); // pre-condition, or could clear it
// or for that matter return a list by value instead.
// copy keys into list
for (MapType::const_iterator it = mymap.begin(); it != mymap.end(); ++it) {
objects.push_back(it->first);
}
// sort the list
objects.sort(CompareByMappedId(mymap));
}
This isn't desperately efficient: it does more looking up in the map than is strictly necessary, and manipulating list nodes in std::list::sort is likely a little slower than std::sort would be at manipulating a random-access container of pointers. But then, std::list itself isn't very efficient for most purposes, so you expect it to be expensive to set one up.
If you need to optimize, you could create a vector of pairs of (int, Object*), so that you only have to iterate over the map once, no need to look things up. Sort the pairs, then put the second element of each pair into the list. That may be a premature optimization, but it's an effective trick in practice.
I would create a new map that had a sort criterion that used the component id of your objects. Populate the second map from the first map (just iterate through or std::copy in). Then you can read this map in order using the iterators.
This has a slight overhead in terms of insertion over using a vector or list (log(n) time instead of constant time), but it avoids the need to sort after you've created the vector or list which is nice.
Also, you'll be able to add more elements to it later in your program and it will maintain its order without need of a resort.
I'm not sure I completely understand what you're trying to store in your map but perhaps look here
The third template argument of an std::map is a less functor. Perhaps you can utilize this to sort the data stored in the map on insertion. Then it would be a straight forward loop on a map iterator to populate a list

Store and update huge (and sparse?) multi-dimensional array efficiently to count conditional probabilities

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk), but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.
C++ code:
struct bigram_key{
int i, j;// words - indexes of the words in a dictionary
// a constructor to be easily constructible
bigram_key(int a_i, int a_j):i(a_i), j(a_j){}
// you need to sort keys to be used in a map container
bool operator<(bigram_key const &other) const{
return i<other.i || (i==other.i && j<other.j);
}
};
struct bigram_data{
int count;// n(ij)
map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}
map<bigram_key, bigram_data> trigrams;
The dictionary could be a vector of all found words like:
vector<string> dictionary;
but for better lookup word->index it could be a map:
map<string, int> dictionary;
When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:
trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;
For better performance you may search for bigram only once:
bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;
Is it understandable? Do you need more details?

Resources