Decompress delta compressed sequence of numbers in Scala - algorithm

Yesterday, I asked about From a List representation of a Map, to a real Map in Scala
After few really smart response, I know that it is necessary change my mind to work in Scala and I am sure that stackoverflow is the way.
In this occasion, I want to decompress a sequence of number stored using delta compression.
I think that my implementation is really simple, but I am sure that you guys are going to find other more functional way.
def deltaDecompression(compressed : Seq[Long]) = {
var previous = 0L
compressed.map(current => {
previous += current
previous
})
}
assert(deltaDecompression(Seq(100,1,2,3)) == Seq(100,101,103,106))
So like in my previous question, the question is: Is possible to implement this function using a more functional way?
Example input data and expected output in the last line of the code, as an assertion.

compressed.scanLeft(0l) { _ + _ }.drop(1)

Related

Direct accessing a nested Map of dynamic depth

I want to input data in a map at its depth. It is easy to do the following:
static Map map;
insert(keyList, x) {
if(keyList.length==1) map[0] = x;
if(keyList.length==2) map[keyList[0]][keyList[1]] = x;
...
}
I would, however, like a more intuitive way to do the same (i.e. using a loop over the keyList) which does not rely on hardcoding iterations.
My attempt at implementing the same led to a resource-heavy recursive implementation:
Map insertInMap(Map<String, dynamic> m, List<String> keyList, Object o) {
if (keyList.length == 1) {
m[keyList.first] = o;
} else {
m[keyList.first] = insertInMap(
Map.from(m[keyList.first]),
keyList.sublist(1),
o,
);
}
return m;
}
I understand the above code is making copies of the Map and refitting them in the original map bottom up. I would like a way to edit the map directly, similar to how map[i][j] works, which I believe does not replicate maps and should thus be much more efficient in memory use as an in-place algorithm.
My search led me to this as question;
Accessing nested array at a dynamic depth which resembles my non-optimal recursive solution. Other answers I attempted searching were about iterating over maps using loops, which are a different topic altogether without the mention of their depth.
If it is not possible, assistance by way of confirming it would save me time and mental anguish.
Thanks in advance.

Removing mutability without losing speed

I have a function like this:
fun randomWalk(numSteps: Int): Int {
var n = 0
repeat(numSteps) { n += (-1 + 2 * Random.nextInt(2)) }
return n.absoluteValue
}
This works fine, except that it uses a mutable variable, and I would like to make everything immutable when possible, for better safety and readability. So I came up with an equivalent version that doesn't use any mutable variables:
fun randomWalk_seq(numSteps: Int): Int =
generateSequence(0) { it + (-1 + 2 * Random.nextInt(2)) }
.elementAt(numSteps)
.absoluteValue
This also works fine and produces the same results, but it takes 3 times longer.
I used the following way to measure it:
#OptIn(ExperimentalTime::class)
fun main() {
val numSamples = 100000
val numSteps = 15708
repeat(5) {
val randomWalkSamples: IntArray
val duration = measureTime {
randomWalkSamples = IntArray(numSamples) { randomWalk(numSteps) }
}
println(duration)
}
}
I know it's a bit hacky (I could have used JMH but this is just a quick test - at least I know that measureTime uses a monotonic clock). The results for the iterative (mutable) version:
2.965358406s
2.560777033s
2.554363661s
2.564279403s
2.608323586s
As expected, the first line shows it took a bit longer on the first run due to the warming up of the JIT, but the next 4 lines have fairly small variation.
After replacing randomWalk with randomWalk_seq:
6.636866719s
6.980840906s
6.993998111s
6.994038706s
7.018054467s
Somewhat surprisingly, I don't see any warmup time - the first line is always lesser duration than the following 4 lines, every time I run this. And also, every time I run it, the duration keeps increasing, with line 5 always being the greatest duration.
Can someone explain the findings, and also is there any way of making this function not use any mutable variables but still have performance that is close to the mutable version?
Your solution is slower for two main reasons: boxing and the complexity of the iterator used by generateSequence()'s Sequence implementation.
Boxing happens because a Sequence uses its types generically, so it cannot use primitive 32-bit Ints directly, but must wrap them in classes and unwrap them when retrieving the items.
You can see the complexity of the iterator by Ctrl+clicking the generateSequence function to view the source code.
#Михаил Нафталь's suggestion is faster because it avoids the complex iterator of the sequence, but it still has boxing.
I tried writing an overload of sumOf that uses IntProgression directly instead of Iterable<T>, so it won't use boxing, and that resulted in equivalent performance to your imperative code with the var. As you can see, it's inline and when put together with the { -1 + 2 * Random.nextInt(2) } lambda suggested by #Михаил Нафталь, then the resulting compiled code will be equivalent to your imperative code.
inline fun IntProgression.sumOf(selector: (Int) -> Int): Int {
var sum: Int = 0.toInt()
for (element in this) {
sum += selector(element)
}
return sum
}
Ultimately, I don't think you're buying yourself much in the way of code clarity by removing a single var in such a small function. I would say the sequence code is arguably harder to read. vars may add to code complexity in complex algorithms, but I don't think they do in such simple algorithms, especially when there's only one of them and it's local to the function.
Equivalent immutable one-liner is:
fun randomWalk2(numSteps: Int) =
(1..numSteps).sumOf { -1 + 2 * Random.nextInt(2) }.absoluteValue
Probably, even more performant would be to replace
with
so that you'll have one multiplication and n additions instead of n multiplications and (2*n-1) additions:
fun randomWalk3(numSteps: Int) =
(-numSteps + 2 * (1..numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Update
As #Tenfour04 noted, there is no specific stdlib implementation for IntProgression.sumOf, so it's resolved to Iterable<T>.sumOf, which will add unnecessary overhead for int boxing.
So, it's better to use IntArray here instead of IntProgression:
fun randomWalk4(numSteps: Int) =
(-numSteps + 2 * IntArray(numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Still encourage you to check this all with JMH
I think:"Removing mutability without losing speed" is wrong title .because
mutability thing comes to deal with the flow that program want to achieve .
you are using var inside function.... and 100% this var will not ever change from outside this function and that is mutability concept.
if we git rid off from var everywhere why we need it in programming ?

In Scala using variables in a map reduces the performance?

maybe it is a stupid question, but I have this doubt and I cannot find a response...
If I have a map operation on a list of complex objects and to make the code more readable I use intermediate variables inside the map the performance can change?
For example this are two versions of the same code:
profilesGroupedWithIds map {
c =>
val blockId = c._2
val entityIds = c._1._2
val entropy = c._1._1
if (separatorID < 0) BlockDirty(blockId, entityIds.swap, entropy)
else BlockClean(blockId, entityIds, entropy)
}
..
profilesGroupedWithIds map {
c =>
if (separatorID < 0) BlockDirty(c._2, c._1._2.swap, c._1._1)
else BlockClean(c._2, c._1._2, c._1._1)
}
As you can see the first version is more readable than the second one.
But the efficiency is the same? Or the three variables that I have created inside the map have to be removed by the garbage collector and this will reduce the perfomance (suppose that 'profilesGroupedWithIds' is a very big list)?
Thanks
Regards
Luca
The vals are just aliases for the tuple elements. So the generated java bytecode will be identical in both cases, and so will be the performance.
More importantly, the first variant is much better code since it clearly conveys the intent.
Here is a third variant that avoids accessing the tuple elements _1 and _2 entirely:
profilesGroupedWithIds map {
case ((entropy,entityIds),blockId) =>
if (separatorID < 0) BlockDirty(blockId, entityIds.swap, entropy)
else BlockClean(blockId, entityIds, entropy)
}

Storing pointers wrong/not using Unordered_map.find correctly

so the title essentially says it all. I am writing a symbol table in c++ for a compiler project I am working on, and all is going well except for looking up identifiers in the table.
So this is how I store into the table (pseudo like):
vector<symbolTable*>* symbolStack = new symbolTable();
//where a symbolStack is a vector of unordered_maps (symbolTables),
//each iteration in vector referencing a new block of code.
string* check = new string(root->children[0]->lexicode->c_str());
symbol* sym = new symbol();
...... //setting sym info
symbol_entry pair = make_pair(check, test)
//the unordered_map has keys of (string*, symbol*)
symbolStack[tableNumber]->insert(pair);
I am pretty solid that this works, as I have tested printing the size/infos from the map and it all seems to be storing as expect. Here is where the problem is happening for me (this takes place in a different function later):
for(int i = 0; i =< tableNumber;i++){
auto finder = symbolStack[i]->find(checkS) //checkS == check from above
if(finder == symbolStack[i]->end()) cout<<not found;
else cout<<we did it!!!!
My else is never reached. However, if I do this assuming the string*->c_str() == "test":
cout<<string->c_str(); // prints out "test"
cout<<finder->second->c_str() //prints out "test".
So the question. Why is it finding the key, and knowing it found the key, but at the same time returning that is has reached the end of the symbol stack without finding it? I have been trying to figure this out for a good 4 days solid now. Is it that my pointers are somehow off? Any insight is appreciated greatly.
So somewhat answer to my own question.
First I will say this: I have concluded the comparison with find() or similar methods do not work because for some reason the pointers are not matching up. I have no clue why this is still, or what I am doing wrong.
What I did to solve my issue and complete my code is this:
for(int k = 0; k<= tableNumber; k++){
unordered_map<string*,symbol*>::iterator it;
for(it = symbolStack[k]->begin(); it != symbolStack[k]->end(); it++)
{
string a = targetString->c_str();
string b = it->first->c_str();
if(a.compare(b) == 0) cout<<"You have found the match! \n";
}
}
}
So this answers how to get it working pragmatically if somebody else is in a similar ship, however not really answers why my other attempt failed other than noticing the pointer values were different.
In symbolTable you store pointers to strings as keys, not strings themselves. Therefore unordered_map compares pointers, not strings, and cannot find matching items. When you reconstruct the key string (as in your answer, using string b = it->first->c_str()), the comparison on strings works again. So, either you need to store string instead of string * in symbolTable, or you need to provide your own comparison function that will compare keys of type string *.

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Resources