Create 3rd vector while looping through 2 others - c++11

I'm totally newbie in C++ and I need to solve a problem with vectors. What I need is to merge two existing vectors and create third one. While I saw several answers, the difference here is I need vector #3 (values3) to contain not all values, but only those which are in both vectors #1 (values1) and #2 (values2). So, if integer 2 is in vector 1 but is not in vector 2, this number does not fit me. I should use a function provided below. Commented lines are which I don't know what to write in. Other lines are working.
void CommonValues(vector<MainClass> & values1, vector<MainClass> & values2, vector<MainClass> & values3)
{
MainClass Class;
string pav;
int kiek;
vector<MainClass>::iterator iter3; // ?
for (vector<MainClass>::iterator iter1 = values1.begin(); iter1 != values1.end(); iter1++)
{
for (vector<MainClass>::iterator iter2 = values2.begin(); iter2 != values2.end(); iter2++)
{
if (iter1 == iter2)
{
pav = iter2->TakePav();
iter3->TakePav(pav); // ?
kiek = iter1->TakeKiek() + iter2->TakeKiek();
iter3->TakeKie(kiek); // ?
iter3++; // ?
}
}
}
}

You can sort values1 and values2, then use std::intersection: http://en.cppreference.com/w/cpp/algorithm/set_intersection
Your code at the moment won't work, among other problems, you are comparing iterator from vector 1 with iterator from vector 2, which doesn't make any sense. If you want to do it by looping, you should iterate through one vector and check if the value, for example *iter1, is in the 2nd vector.

Related

Getting the Shorest Path question right in Kotlin

So I got a question that was delivered as a 2D List
val SPE = listOf(
listOf('w', 'x'),
listOf('x', 'y'),
listOf('z', 'y'),
listOf('z', 'v'),
listOf('w', 'v')
)
It asks to find the shortest path between w and z. So obviously, BFS would be the best course of action here to find that path the fastest. Here's my code for it
fun shortestPath(edges: List<List<Char>>, root: Char, destination: Char): Int {
val graph = buildGraph3(edges)
val visited = hashSetOf(root)
val queue = mutableListOf(mutableListOf(root, 0))
while (queue.size > 0){
val node = queue[0].removeFirst()
val distance = queue[0].removeAt(1)
if (node == destination) return distance as Int
graph[node]!!.forEach{
if (!visited.contains(it)){
visited.add(it)
queue.add(mutableListOf(it, distance + 1))
}
}
}
queue.sortedByDescending { it.size }
return queue[0][1]
}
fun buildGraph3(edges: List<List<Char>>): HashMap<Char, MutableList<Char>> {
val graph = HashMap<Char, MutableList<Char>>()
for (i in edges.indices){
for (n in 0 until edges[i].size){
var a = edges[i][0]
var b = edges[i][1]
if (!graph.containsKey(a)) { graph[a] = mutableListOf() }
if (!graph.containsKey(b)) { graph[b] = mutableListOf() }
graph[a]?.add(b)
graph[b]?.add(b)
}
}
return graph
}
I am stuck on the return part. I wanted to use a list to keep track of the incrementation of the char, but it wont let me return the number. I could have done this wrong, so any help is appreciated. Thanks.
If I paste your code into an editor I get this warning on your return queue[0][1] statement:
Type mismatch: inferred type is {Comparable<*> & java.io.Serializable} but Int was expected
The problem here is queue contains lists that hold Chars and Int distances, mixed together. You haven't specified the type that list holds, so Kotlin has to infer it from the types of the things you've put in the list. The most general type that covers both is Any?, but the compiler tries to be as specific as it can, inferring the most specific type that covers both Char and Int.
In this case, that's Comparable<*> & java.io.Serializable. So when you pull an item out with queue[0][1], the value you get is a Comparable<*> & java.io.Serializable, not an Int, which is what your function is supposed to be returning.
You can "fix" this by casting - since you know how your list is meant to be organised, two elements with a Char then an Int, you can provide that information to the compiler, since it has no idea what you're doing beyond what it can infer:
val node = queue[0].removeFirst() as Char
val distance = queue[0].removeAt(1) as Int
...
return queue[0][1] as Int
But ideally you'd be using the type system to create some structure around your data, so the compiler knows exactly what everything is. The most simple, generic one of these is a Pair (or a Triple if you need 3 elements):
val queue = mutableListOf(Pair<Char, Int>(root, 0))
// or if you don't want to explicitly specify the type
val queue = mutableListOf(root to 0)
Now the type system knows that the items in your queue are Pairs where the first element is a Char, and the second is an Int. No need to cast anything, and it will be able to help you as you try to work with that data, and tell you if you're doing the wrong thing.
It might be better to make actual classes that reflect your data, e.g.
data class Step(node: Char, distance: Int)
because a Pair is pretty general, but it's up to you. You can pull the data out of it like this:
val node = queue[0].first
val distance = queue[0].second
// or use destructuring to assign the components to multiple variables at once
val (node, distance) = queue[0]
If you make those changes, you'll have to rework some of your algorithm - but you'll have to do that anyway, it's broken in a few ways. I'll just give you some pointers:
your return queue[0][1] line can only be reached when queue is empty
queue[0].removeAt(1) is happening on a list that now has 1 element (i.e. at index 0)
don't you need to remove items from your queue instead?
when building your graph, you call add(b) twice
try printing your graph, the queue at each stage in the loop etc to see what's happening! Make sure it's doing what you expect. Comment out any code that doesn't work so you can make sure the stuff that runs before that is working.
Good luck with it! Hopefully once you get your types sorted out things will start to fall into place more easily

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

How to sort multiple columns: CSV? c++

I am attempting to sort a CSV file by specifying which column order to sort in:
for example: ./csort 3, 1, 5 < DATA > SORTED_DATA
or ./csort 3, 4, 6, 2, 1, 5 < DATA ...
example line of DATA: 177,27,2,42,285,220
I used a vector split(string str) function to store the columns specified in the arguments which require sorting. Creating a vector:
vector<string> columns {3, 1, 5}; // for example
Not entirely sure how to use this columns vector to proceed with the sorting process; though, I am aware that I could use sort.
sort(v.begin(), v.end(), myfunction);
As I understand your question, you have already parsed your data into 4 vectors, 1 vector per column, and you want to be able to sort your data, specifying the prececedence of the column to sort -- i.e. sort by col1, then col3, then col4...
What you want to do isn't too difficult, but you'll have to backtrack a bit. There are multiple ways to approach the problem, but here's a rough outline. Based on the level of expertise you exhibit in your question, you might have to look a few terms in the following outline, but if you do you'll have a good flexible solution to your problem.
You want to store your data by row, since you want to sort rows... 4 vector for 4 columns won't help you here. If all 4 elements in the row are going to be a the same type, you could use a std::vector or std::array for the row. std::array is solid if # cols is known compile time, std::vector for runtime. If the types are inhomogeneous, you could use a tuple, or a struct. Whatever type you use, let's call it RowT.
Parse and store into your rows, make a vector of RowT.
Define a function-object which provides the () operator for a left and right hand side of RowT. It must implement the "less than operation" following the precedence you want. Lets call that class CustomSorter.
Once you have that in place, your final sort will be:
CustomSorter cs(/*precedence arguments*/);
std::sort(rows.begin(), rows.end(), cs);
Everything is really straightforward, a basic example can bee seen here in the customsort example. In my experience the only part you will have to work at is the sort algorithm itself.
The easiest way is to use a class that has a list of indexes as a member, and go through the list in order to see if the item is less than the other.
class VecLess
{
std::vector<int> indexes;
public:
VecLess(std::vector<int> init) : indexes(init)
{
}
bool operator()(const std::vector<string> & lhs, const std::vector<string> rhs)
{
for (auto i = indexes.begin(); i != indexes.end(); ++i)
{
if (lhs[*i] < rhs[*i])
return true;
if (rhs[*i] < lhs[*i])
return false;
}
return false;
}
};

Equality of All Elements in A Range

What Phobos algorithm should I use to check if all elements in a range are equal or not? I've looked in std.algorithm and the closest I've found is equal but it takes two ranges as argument. I also cannot find a way to apply reduce to solve this problem.
Nice, Adam. A few more possibilities:
foo.empty || foo.equal(repeat(foo.front, foo.length))
or
foo.empty || repeat(foo.front).startsWith(foo)
or
foo.findAdjacent!("a != b").empty
One option would be to use canFind:
import std.algorithm;
import std.range;
void main() {
int[] foo = [1,1,2];
if(!foo.empty) {
if(!canFind!"a != b"(foo, foo.front))
// they are equal
else
// not equal
} else { /* nothing to compare against */ }
}
The logic here is if they are all equal, then it should not be able to find an item that is not equal to the first item.
Andrei's answer has several more options!

word distribution problem

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
Thanks
You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.
Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.
If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
{
public int Frequency { get; set; }
public void Add(string word)
{
this.Add(word.ToCharArray());
}
private void Add(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
this.Add(first, new Trie());
}
if (chars.Length == 1)
{
this[first].Frequency += 1;
}
else
{
this[first].Add(chars.Skip(1).ToArray());
}
}
public int GetFrequency(string word)
{
return this.GetFrequency(word.ToCharArray());
}
private int GetFrequency(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
return 0;
}
if (chars.Length == 1)
{
return this[first].Frequency;
}
else
{
return this[first].GetFrequency(chars.Skip(1).ToArray());
}
}
}
Then you can call code like this:
var t = new Trie();
t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.
do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.
Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.
Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.
If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())

Resources