Pyspark least optimization - performance

I'm quite new to pyspark and I was wondering about performance on this peace of code. I have a dataframe on which I create a new column that will match the least of 3 possibilities.
Is it optimal with this code :
new_df = input_df.withColumn("aCol"
sf.least(sf.lit(1),
sf.when(sf.col("categorie") == 0, 0)
.otherwise(sf.col("quotite") / sf.col("categorie"))))
return new_df
Or take out the first "when" :
new_df = input_df.withColumn("aCol"
sf.when(sf.col("categorie") == 0, 0)
.otherwise(sf.least(sf.lit(1),
sf.col("quotite") / sf.col("categorie"))))
return new_df
Thank you, community ;)

Related

One coding problem two different solutions, how to prove is correct?

I have a coding problem:
The awards committee of your alma mater (i.e. your college/university) asked for your assistance with a budget allocation problem they’re facing. Originally, the committee planned to give N research grants this year. However, due to spending cutbacks, the budget was reduced to newBudget dollars and now they need to reallocate the grants. The committee made a decision that they’d like to impact as few grant recipients as possible by applying a maximum cap on all grants. Every grant initially planned to be higher than cap will now be exactly cap dollars. Grants less or equal to cap, obviously, won’t be impacted.
Given an array grantsArray of the original grants and the reduced budget newBudget, write a function findGrantsCap that finds in the most efficient manner a cap such that the least number of recipients is impacted and that the new budget constraint is met (i.e. sum of the N reallocated grants equals to newBudget).
Analyse the time and space complexities of your solution.
Example:
input: grantsArray = [2, 100, 50, 120, 1000], newBudget = 190
output: 47
The recommended solution is:
fun findCorrectGrantsCap(grantsArray: DoubleArray, newBudget: Double): Double {
grantsArray.sortDescending()
val grantsArray = grantsArray + 0.0
var surplus = grantsArray.sum() - newBudget
if (surplus <= 0)
return grantsArray[0]
var lastIndex = 0
for(i in 0 until grantsArray.lastIndex) {
lastIndex = i
surplus -= (i+1) * (grantsArray[i] - grantsArray[i+1])
if (surplus <= 0)
break
}
return grantsArray[lastIndex+1] + (-surplus / (lastIndex.toDouble()+1))
}
Compact and complexity is O(nlogn)
I came across with O(n) solution with tiny fractional part difference in the result between suggested solution and my one:
fun DoubleArray.calcSumAndCount(averageCap: Double, round: Boolean): Pair<Double, Int> {
var count = 0
var sum = 0.0
forEach {
if(round && it > round(averageCap))
count++
else if(!round && it > averageCap)
count++
else
sum+=it
}
return sum to count
}
fun Pair<Double, Int>.calcCap(budget: Double) =
(budget-first)/second
fun findGrantsCap(grantsArray: DoubleArray, newBudget: Double): Double {
if(grantsArray.isEmpty())
return 0.0
val averageCap = newBudget/grantsArray.size
if(grantsArray.sum() <= newBudget)
return grantsArray.maxOf { it }
var sumAndCount = grantsArray.calcSumAndCount(averageCap, false)
val cap = sumAndCount.calcCap(newBudget)
val finalSum = grantsArray.sumOf {
if(it > cap)
cap
else it
}
return if(finalSum == newBudget)
cap
else
grantsArray
.calcSumAndCount(averageCap, true)
.calcCap(newBudget)
}
I wonder if any test case to prove that my solution incorrect or vice versa is correct since provided approaches to solve this coding problem completely different.
Original source doesn't provide reach test cases.
UPDATE
As PaulHankin suggested I wrote simple test:
repeat(1000000) {
val grants = (0..Random.nextInt(6)).map { Random.nextDouble(0.0, 9000000000.0) }.toDoubleArray()
val newBudget = Random.nextDouble(0.0, 9000000000.0)
val cap1 = findCorrectGrantsCap(grants, newBudget)
val cap2 = findGrantsCap(grants, newBudget)
if (abs(cap1 - cap2) > .00001)
println("FAILED: $cap1 != $cap2 (${grants.joinToString()}), $newBudget")
}
And it's failed he is right. But when I redesigned my solution:
fun findGrantsCap(grantsArray: DoubleArray, newBudget: Double): Double {
if(grantsArray.isEmpty())
return 0.0
if(grantsArray.sum() <= newBudget)
return grantsArray.maxOf { it }
grantsArray.sort()
var size = grantsArray.size
var averageCap = newBudget/size
var tempBudget = newBudget
for(grant in grantsArray) {
if(grant <= averageCap) {
size--
tempBudget -= grant
averageCap = tempBudget/size
} else break
}
return averageCap
}
After that the test cases pass successfully, the only problem with double precision/overflow error if I use large Doubles if increase limits of input for grants and/or budget (it can be fixed using BigDecimal instead for large inputs).
So the latest solution is correct now? Or is still can be some test cases where it can be failed?

How do I add a pre determined stop loss and take profit?

This is my first post. I am a prop trader and really trying hard to learn how to code as it would take my trading to another level. It is quite overwhelming at the beginning, but working on things that have use to me is motivating.
I have a script for trading view that I would like to edit. I have tried myself but I am obviously doing something wrong. Any help would be very much appreciated.
I just want to add my own pre determined stop loss and take profit for the strategy, The code is below:
strategy(title="Z-Score Strategy", shorttitle="Z-Score Strategy")
Period = input(20, minval=1)
Trigger = input(0)
reverse = input(false, title="Trade reverse")
hline(Trigger, color=purple, linestyle=line)
xStdDev = stdev(close, Period)
xMA = sma(close, Period)
nRes = (close - xMA) / xStdDev
pos = iff(nRes > Trigger, 1,
iff(nRes < Trigger, -1, nz(pos[1], 0)))
possig = iff(reverse and pos == 1, -1,
iff(reverse and pos == -1, 1, pos))
if (possig == 1)
strategy.entry("Long", strategy.long)
if (possig == -1)
strategy.entry("Short", strategy.short)
barcolor(possig == -1 ? red: possig == 1 ? green : blue )
plot(nRes, color=blue, title="Z-Score")
You should close the position via strategy.exit: https://www.tradingview.com/pine-script-reference/v4/#fun_strategy{dot}exit
//#version=4
strategy("strategy")
strategy.entry("entryId", strategy.long)
strategy.exit("exitId", "entryId", profit = 5, stop=7)

Halide: How to avoid unwanted assertions

During development of a pipeline in Halide,
I want to avoid unnecessary checks on buffer layouts.
I know I can turn off the majority of assertions using the 'no_asserts' target feature.
However, I generated the following simple code:
#define LUT_SIZE 17 /* Size in each dimension of the 4D LUT */
class ApplyLut : public Halide::Generator<ApplyLut> {
public:
// We declare the Inputs to the Halide pipeline as public
// member variables. They'll appear in the signature of our generated
// function in the same order as we declare them.
Input < Buffer<uint8_t>> Lut { "Lut" , 1}; // LUT to apply
Input < Buffer<int>> indexToLut { "indexToLut" , 1}; // Precalculated mapping of uint8_t to LUT index
Input < Buffer<uint8_t >> inputImageLine { "inputImageLine" , 1}; // Input line
Output< Buffer<uint8_t >> outputImageLine { "outputImageLine", 1}; // Output line
void generate();
};
HALIDE_REGISTER_GENERATOR(ApplyLut, outputImageLine)
void ApplyLut::generate()
{
Var x("x");
outputImageLine(x) = Lut(clamp(indexToLut(inputImageLine(x)), 0, LUT_SIZE));
inputImageLine .dim(0).set_min(0); // Input image sample index
inputImageLine .dim(0).set_stride(1); // Input image sample index
outputImageLine.dim(0).set_bounds(0, inputImageLine.dim(0).extent()); // Output line matches input line
outputImageLine.dim(0).set_stride( inputImageLine.dim(0).stride()); // Output line matches input line
Lut .dim(0).set_bounds(0, LUT_SIZE); //iccLut[...]: , limited number of values
Lut .dim(0).set_stride(1); //iccLut[...]: , limited number of values
indexToLut .dim(0).set_bounds(0, 256); //chan4_offset[...]: value index: 256 values
indexToLut .dim(0).set_stride(1); //chan4_offset[...]: value index: 256 values
}
Among others, I used the target feature 'no_assert' during generation (as can be seen in the output).
I then get the following output code:
module name=applyIccProfile, target=x86-64-windows-disable_llvm_loop_opt-mingw-no_asserts-no_bounds_query-no_runtime-sse41 {
func applyIccProfile(Lut, indexToLut, inputImageLine, outputImageLine) {
assert((reinterpret(outputImageLine.buffer) != (uint64)0), halide_error_buffer_argument_is_null("outputImageLine"))
assert((reinterpret(inputImageLine.buffer) != (uint64)0), halide_error_buffer_argument_is_null("inputImageLine"))
assert((reinterpret(indexToLut.buffer) != (uint64)0), halide_error_buffer_argument_is_null("indexToLut"))
assert((reinterpret(Lut.buffer) != (uint64)0), halide_error_buffer_argument_is_null("Lut"))
let Lut = _halide_buffer_get_host(Lut.buffer)
let Lut.min.0 = _halide_buffer_get_min(Lut.buffer, 0)
let Lut.extent.0 = _halide_buffer_get_extent(Lut.buffer, 0)
let Lut.stride.0 = _halide_buffer_get_stride(Lut.buffer, 0)
let indexToLut = _halide_buffer_get_host(indexToLut.buffer)
let indexToLut.min.0 = _halide_buffer_get_min(indexToLut.buffer, 0)
let indexToLut.extent.0 = _halide_buffer_get_extent(indexToLut.buffer, 0)
let indexToLut.stride.0 = _halide_buffer_get_stride(indexToLut.buffer, 0)
let inputImageLine = _halide_buffer_get_host(inputImageLine.buffer)
let inputImageLine.min.0 = _halide_buffer_get_min(inputImageLine.buffer, 0)
let inputImageLine.extent.0 = _halide_buffer_get_extent(inputImageLine.buffer, 0)
let inputImageLine.stride.0 = _halide_buffer_get_stride(inputImageLine.buffer, 0)
let outputImageLine = _halide_buffer_get_host(outputImageLine.buffer)
let outputImageLine.min.0 = _halide_buffer_get_min(outputImageLine.buffer, 0)
let outputImageLine.extent.0 = _halide_buffer_get_extent(outputImageLine.buffer, 0)
let outputImageLine.stride.0 = _halide_buffer_get_stride(outputImageLine.buffer, 0)
assert((Lut.stride.0 == 1), 0)
assert((Lut.min.0 == 0), 0)
assert((Lut.extent.0 == 17), 0)
assert((indexToLut.stride.0 == 1), 0)
assert((indexToLut.min.0 == 0), 0)
assert((indexToLut.extent.0 == 256), 0)
assert((inputImageLine.stride.0 == 1), 0)
assert((inputImageLine.min.0 == 0), 0)
assert((outputImageLine.stride.0 == 1), 0)
assert((outputImageLine.min.0 == 0), 0)
assert((outputImageLine.extent.0 == inputImageLine.extent.0), 0)
produce outputImageLine {
for (outputImageLine.s0.x, 0, inputImageLine.extent.0) {
outputImageLine[outputImageLine.s0.x] = Lut[max(min(indexToLut[int32(inputImageLine[outputImageLine.s0.x])], 17), 0)]
}
}
}
}
In the generated output a number of assertions are present
that check the dimensions of the buffers that are provided.
I know that these assertions are executed 'only' once for each call.
However, given the number of calls I would like to turn off these assertions,
because of the execution overhead.
So the questions are:
How can I turn off the assignments w.r.t. min/extent/stride for those that are already known (because these were set in the generator code)?
How can I turn off the generation of these assertions?
While asserts still show up in the Halide IR with no_asserts on, any remaining ones get stripped in the final lowering to LLVM IR. They just exist in the Halide IR because they let the Halide simplifier know that something can be assumed to be true after that point in the code, but they compile to a no-op.
With the asserts gone, LLVM will dead-code-eliminate the unnecessary assignments. I'd check the generated assembly rather than the Halide IR to be sure all of those checks are gone.

Best practice to evaluate permutations

I came across this question, where the OP wanted to improve the following if-block. I open this as a new question because I'm searching a more general solution to this kind of problem.
public int fightMath(int one, int two) {
if(one == 0 && two == 0) { result = 0; }
else if(one == 0 && two == 1) { result = 0; }
else if(one == 0 && two == 2) { result = 1; }
else if(one == 0 && two == 3) { result = 2; }
else if(one == 1 && two == 0) { result = 0; }
else if(one == 1 && two == 1) { result = 0; }
else if(one == 1 && two == 2) { result = 2; }
else if(one == 1 && two == 3) { result = 1; }
else if(one == 2 && two == 0) { result = 2; }
else if(one == 2 && two == 1) { result = 1; }
else if(one == 2 && two == 2) { result = 3; }
else if(one == 2 && two == 3) { result = 3; }
else if(one == 3 && two == 0) { result = 2; }
else if(one == 3 && two == 1) { result = 1; }
else if(one == 3 && two == 2) { result = 3; }
else if(one == 3 && two == 3) { result = 3; }
return result;
}
Now there are n^k possibilities to get a result, where n = 2 and k = 4.
Some answers are suggesting to use an multi-array as a table to reduce the if-jungle.
But I would like to know how to solve such a problem with big n and k? Because a solution with if, switch and the suggested array approach will not scale well and to type things like that in code should be avoided.
If I think about combinatoric problems, there have to be a way to evaluate them easy.
It's just a table of data. The answer to the question is found by multiple keys. It is no different to returning some data held in a database table which could itself be huge and perhaps span multiple tables.
There are two ways to solve this:
Data-based. For example you could create a HashMap mapping the pair of values to the result.
class Pair {
int one, two;
//Generate hashcode and equals
}
Map<Pair, Integer> counts = new HashMap<>();
Pattern-based. Identify a rule/formula that can be used to determine the new value.
This is obviously better but relies on being able to identify a rule that covers all cases.
I would like to know how to solve such a problem with big n and k.
Since the output is determined arbitrarily (a game designer's whims) instead of mathematically (a formula), there's no guarantee of any pattern. Therefore the only general solution is some kind of lookup table.
Essentially, the question is similar to asking for a program that does f(a,b) -> c mapping, but you don't know any of the data beforehand -- instead it's provided at runtime. That program could process the data and find a pattern/formula (which might not exist) or it could build a lookup table.
Personally, I think it's clearer to change the logic to operate on intent (so reading the code explains how the attack works) instead of the actual outcomes (enumerating the list of inputs and matching outputs). Instead of building an if-jungle or a lookup table, structure your code based on how you want the logic to work. JAB's enum-based solution expresses the fight logic explicitly which makes it easier to see where to add new functionality and easier to see bugs (an off by one error in a lookup table isn't obviously wrong on inspection). A lookup table is a likely optimization, but that's only necessary if a profiler says so.
Looking at your question and the original one there appears to be no deducible pattern between the input from the two players and the output (perhaps I'm wrong). Given this the only options are the "if-jungle" you mention or to use a data structure.
To solve such a problem for big n and k values my suggestion would be to create a rule to determine the output (either none, one or both players hit), but ensuring that this rule isn't easily deducible to the players. You could do this by making the rule a function of turn number (e.g. if both players press button 1 on turn #1 the output will be different to if they take the same action on turn #2).

In MongoDB, how can I replicate this simple query using map/reduce in ruby?

So using the regular MongoDB library in Ruby I have the following query to find average filesize across a set of 5001 documents:
avg = 0
total = collection.count()
Rails.logger.info "#{total} asset creation stats in the system"
collection.find().each {|row| avg += (row["filesize"] * (1/total.to_f)) if row["filesize"]}
Its pretty simple, so I'm trying to do the same using map/reduce as a learning exercise. This is what I came up with:
map = 'function(){emit("filesizes", {size: this.filesize, num: 1});}'
reduce = 'function(k, vals){
var result = {size: 0, num: 0};
for(var x in vals) {
var new_total = result.num + vals[x].num;
result.num = new_total
result.size = result.size + (vals[x].size * (vals[x].num / new_total));
}
return result;
}'
#results = collection.map_reduce(map, reduce)
However the two queries come back with two different results!
What am I doing wrong?
You're weighting the results by doing the division in every reduce function.
Say you had [{size : 5, num : 1}, {size : 5, num : 1}, {size : 5, num : 1}]. Your reduce would calculate:
result.size = 0 + (5*(1/1)) = 5
result.size = 5 + (5*(1/2)) = 7.25
result.size = 7.25 + (5*(1/3)) = 8.9
As you can see, this weights the results towards the earliest elements.
Fortunately, there's a simple solution. Just add a finalize function, which will be run once after the reduce step is finished.

Resources