Seaborn global hue order - seaborn

I very often use the hue argument to distinguish between categories but it seems like seaborn isn't consistent in how it matches hues to categories (from what I've read it depends on the plotted data, in particular its order). I would like to avoid passing the hue_order argument everywhere because I know I will forget it at some point and not notice it (which will lead to misinterpretations because I will suppose hues are correct).
Is there a way to set the hue_order globally (fixed order for all plots)?
Even better, would it possible to set categorical indexes to all behave the same (eg., alphanumeric order)?
For now I use the following ugly strategy:
SNS_SETTINGS = dict(hue_order=[...])
sns.displot(df, **SNS_SETTINGS, x="time", kind="ecdf", hue="algorithm")

A very practical solution is to add the hue parameter in the SNS_SETTINGS dictionary. This coupling will ensure the needed consistency across your plots.
Another solution, that may or may not be adequate in your specific case, would be to define custom functions with functools.partial, defining the parameters once to have shorter function calls:
from functools import partial
displot_by_algorithm = partial(sns.displot, hue="algorithm", hue_order=[...])
This way, you can later call
displot_by_algorithm(df, x="time", kind="ecdf")
Of course, you will have to define such function for all the different plotting functions you want to use, so the trade-off might not be worth it.

Related

Optimize other parameters than the predefined, using step-wise algorithm in optuna.integration.lightgbm

As far as I understand, the LightGBM integration in Optuna uses a step-wise algorithm to optimize the hyper-parameters such as lambda_l1, lambda_l2 etc.
Although it is great, I would very much want to add additional parameters such as learning_rates.
I know I can just use Optuna the "regular" way but since the integrated lgbm part should be way faster, I would prefer using that.
Is there a way to add additional parameters to optimize, or are we forced to use all of (and only) the specified parameters? I can see theres e.g a parameter called learning_rates but in the docs it is not specified what that does and how to use it (I think it's the learning-rate for each tree). Setting it in the lgb.train like
model = lgb.train(
params,
dtrain,
valid_sets=[dtrain, dval],
)

Is there such a thing as ‘class bloat’ - i.e. too many classes causing inefficiencies?

E.g. let’s consider I have the following classes:
Item
ItemProperty which would include objects such as Colour and Size. There's a relation-property of the Item class which lists all of the ItemProperty objects applicable to this Item (i.e. for one item you might need to specify the Colour and for another you might want to specify the Size).
ItemPropertyOption would include objects such as Red, Green (for Colour) and Big, Small (for Size).
Then an Item Object would relate to an ItemProperty, whereas an ItemChoice Object would relate to an ItemPropertyOption (and the ItemProperty which the ItemPropertyOption refers to could be inferred).
The reason for this is so I could then make use of queries much more effectively. i.e. give me all item-choices which are Red. It would also allow me to use the Parse Dashboard to quickly add elements to the site as I could easily specify more ItemPropertys and ItemPropertyOptions, rather than having to add them in the codebase.
This is just a small example and there's many more instances where I'd like to use classes so that 'options' for various drop-downs in forms are in the database and can easily be added and edited by me, rather than hard-coded.
1) I’ll probably be doing this in a similar way for 5+ more similar kinds of class-structures
2) there could be hundreds of nested properties that I want to access via ‘inverse querying’
So, I can think of 2 potential causes of inefficiency and wanted to know if they’re founded:
Is having lots of classes inefficient?
Is back-querying against nested classes inefficient?
The other option I can think of — if ‘class-bloat’ really is a problem — is to make fields on parent classes that, instead of being nested across other classes (that represent further properties, as above), just representing them as a nested JSON property directly.
The job of designing is to render in object descriptions truths about the world that are relevent to the system's requirements. In the world of the OP's "items", it's a fact that items have color, and it's a relevant fact because users care about an item's color. You'd only call a system inefficient if it consumes computing resources that it doesn't need to consume.
So, for something like a configurator, the fact that we have items, and that those items have properties, and those properties have an enumerable set of possible values sounds like a perfectly rational design.
Is it inefficient or "bloated"? The only place I'd raise doubt is in the explicit assertion that items have properties. Of course they do, but that's natively true of javascript objects and parse entities.
In other words, you might be able to get along with just item and several flavors of propertyOptions: e.g. Item has an attribute called "colorProperty" that is a pointer to an instance of "ColorProperty" (whose instances have a name property like 'red', 'green', etc. and maybe describe other pertinent facts, like a more precise description in RGB form).
There's nothing wrong with lots of classes if they represent relevant truth. Do that first. You might discover empirically that your design is too resource consumptive (I doubt you will in this case), at which point we'd start looking for cheats to be somehow skinnier. But do it the right way first, cheat later only if you must.
Is having lots of classes inefficient?
It's certainly inefficient for poor humans who have to remember what all those classes do and how they're related to each other. It takes time to write all those classes in the first place, and every line that you write is a line that has to be maintained.
Beyond that, there's certainly some cost for each class in any OOP language, and creating more classes than you really need will mean that you're paying more than you need to for the work that you're doing, which is pretty much the definition of inefficient.
I’ll probably be doing this in a similar way for 5+ more similar kinds of class-structures
Maybe you could spend some time thinking about the similarity between these cases and come up with a single set of more flexible classes that you can use in all those cases. Writing general code is harder than writing very specific code, but if you do a good job you'll recoup the extra effort many times over through reuse.

Is there a difference between fun(n::Integer) and fun(n::T) where T<:Integer in performance/code generation?

In Julia, I most often see code written like fun(n::T) where T<:Integer, when the function works for all subtypes of Integer. But sometimes, I also see fun(n::Integer), which some guides claim is equivalent to the above, whereas others say it's less efficient because Julia doesn't specialize on the specific subtype unless the subtype T is explicitly referred to.
The latter form is obviously more convenient, and I'd like to be able to use that if possible, but are the two forms equivalent? If not, what are the practicaly differences between them?
Yes Bogumił Kamiński is correct in his comment: f(n::T) where T<:Integer and f(n::Integer) will behave exactly the same, with the exception the the former method will have the name T already defined in its body. Of course, in the latter case you can just explicitly assign T = typeof(n) and it'll be computed at compile time.
There are a few other cases where using a TypeVar like this is crucially important, though, and it's probably worth calling them out:
f(::Array{T}) where T<:Integer is indeed very different from f(::Array{Integer}). This is the common parametric invariance gotcha (docs and another SO question about it).
f(::Type) will generate just one specialization for all DataTypes. Because types are so important to Julia, the Type type itself is special and allows parameterization like Type{Integer} to allow you to specify just the Integer type. You can use f(::Type{T}) where T<:Integer to require Julia to specialize on the exact type of Type it gets as an argument, allowing Integer or any subtypes thereof.
Both definitions are equivalent. Normally you will use fun(n::Integer) form and apply fun(n::T) where T<:Integer only if you need to use specific type T directly in your code. For example consider the following definitions from Base (all following definitions are also from Base) where it has a natural use:
zero(::Type{T}) where {T<:Number} = convert(T,0)
or
(+)(x::T, y::T) where {T<:BitInteger} = add_int(x, y)
And even if you need type information in many cases it is enough to use typeof function. Again an example definition is:
oftype(x, y) = convert(typeof(x), y)
Even if you are using a parametric type you can often avoid using where clause (which is a bit verbose) like in:
median(r::AbstractRange{<:Real}) = mean(r)
because you do not care about the actual value of the parameter in the body of the function.
Now - if you are Julia user like me - the question is how to convince yourself that this works as expected. There are the following methods:
you can check that one definition overwrites the other in methods table (i.e. after evaluating both definitions only one method is present for this function);
you can check code generated by both functions using #code_typed, #code_warntype, #code_llvm or #code_native etc. and find out that it is the same
finally you can benchmark the code for performance using BenchmarkTools
A nice plot explaining what Julia does with your code is here http://slides.com/valentinchuravy/julia-parallelism#/1/1 (I also recommend the whole presentation to any Julia user - it is excellent). And you can see on it that Julia after lowering AST applies type inference step to specialize function call before LLVM codegen step.
You can hint Julia compiler to avoid specialization. This is done using #nospecialize macro on Julia 0.7 (it is only a hint though).

Syntax when dereferencing database-backed tree

I'm using MongoDB, so my clusters of data are in dictionaries. Some of these contain references to other Mongo objects. For example, say I have a Person document which has a separate Employer document. I would like to control element access so I can automatically dereference documents. I also have some data with dates, and since PyMongo can't store timezone info, I'd like to store a string timezone alongside the UTC time and have an accessor to the converted times easily.
Which of these options seems the best to you?
Person = {'employer': ObjectID}
Employer = {'name': str}
Option 1: Augmented operations are methods
Examples
print person.get_employer()['name']
person.get_employer()['name'] = 'Foo'
person.set_employer(new_employer)
Pro: Method syntax makes it clear that getting the employer is not just dictionary access
Con: Differences between the syntaxes between referenced objects and not, making it hard to normalize the schema if necessary. Augmenting an element would require changing the callers
Option 2: Everything is an attribute
Examples
print person.employer.name
person.employer.name = 'Foo'
person.employer = new_employer
Pro: Uniform syntax for augmented and non-augmented
?: Makes it unclear that this is backed by a dictionary, but provides a layer of abstraction?
Con: Requires morphing a dictionary to an object, not pythonic?
Option 3: Everything is a dictionary item
Examples
print person['employer']['name']
person['employer']['name'] = 'Foo'
person['employer'] = new_employer
Pro: Uniform syntax for augmented and non-augmented
?: Makes it unclear that some of these accesses are actually method calls, but provides a layer of abstraction?
Con: Dictionary item syntax is error-prone to type IMHO.
Your first 2 options would require making a "Person" class and an "Employer" class, and using __dict__ to read values and setattr for writing values. This approach will be slower, but will be more flexible (you can add new methods, validation, etc.)
The simplest way would be to use only dictionaries (option 3). It wouldn't require any need for oop. Personally, I also find it to be the most readable of the 3.
So, if I were you, I would use option 3. It is nice and simple, and easy to expand on later if you change your mind. If I had to choose between the first two, I would choose the second (I don't like overusing getters and setters).
P.S. I'd keep away from person.get_employer()['name'] = 'Foo', regardless of what you do.
Do not be afraid to write a custom class when that will make the subsequent code easier to write/read/debug/etc.
Option 1 is good when you're calling something that's slow/intensive/whatever -- and you'll want to save the results so can use option 2 for subsequent access..
Option 2 is your best bet -- less typing, easier to read, create your classes once then instantiate and away you go (no need to morph your dictionary).
Option 3 doesn't really buy you anything over option 2 (besides more typing, plus allowing typos to pass instead of erroring out)

Defining constants for 0 and 1

I was wondering whether others find it redundant to do something like this...
const double RESET_TIME = 0.0;
timeSinceWhatever = RESET_TIME;
rather than just doing
timeSinceWhatever = 0.0;
Do you find the first example to aid in readability? The argument comes down to using magic numbers, and while 0 and 1 are considered "exceptions" to the rule, I've always kind of thought that these exceptions only apply to initializing variables, or index accessing. When the number is meaningful, it should have a variable attached to its meaning.
I'm wondering whether this assumption is valid, or if it's just redundant to give 0 a named constant.
Well, in your particular example it doesn't make much sense to use a constant.
But, for example, if there was even a small chance that RESET_TIME will change in the future (and become, let's say, 1) then you should definitely use a constant.
You should also use a constant if your intent is not obvious from the number. But in your particular example I think that timeSinceWhatever = 0; is more clear than timeSinceWhatever = RESET_TIME.
Typically, one benefit of defining a constant rather than just using a literal is if the value ever needs to change in several places at once.
From your own example, what if REST_TIME needed to be -1.5 due to some obscure new business rule? You could change it one place, the definition of the constant, or you could change it everywhere you had last used 0.0 as a float literal.
In short, defining constants, in general, aids primarily in maintainability.
If you want to be more specific and letting others know why you're changing doing what you're doing you might want to instead create a function (if your language permits functions to float about) such as
timeSinceWhenever = ResetStopWatch();
or better yet when dealing with units either find a library which has built in function types or create your own. I wouldn't suggest creating your own with time as there are an abundant amount of such libraries. I've seen this before in code if it helps:
Temperature groundTemp = Temperature.AbsoluteZero();
which is a nice way of indicating what is going on.
I would define it only if there was ever a chance that RESET_TIME could be something different than 0.0, that way you can make one change and update all references. Otherwise 0.0 is the better choice to my eye just so you don't have to trace back and see what RESET_TIME was defined to.
Constants are preferable as it allows to use a value that can be then changed in successive versions of the code. It is not always possible to use constants, especially if you are programming in a OO language, and it is not possible to define a constant that doesn't contain a basic datatype. Generally, a programming language always has a way to define not modifiable objects / datatypes.
Well suppose that RESET_TIME is used often in your code and you want to change the value, it will be better to do it once and not in every statement.
better than a constant, make it a configuration variable, and set it to a default value. But yes, RESET_TIME is more readable, provided its used more than once otherwise just use a code comment.
That code is ok. const variable are unchangeable variables. so whenever you feel to reset something, you can always have your const to do that

Resources