I want to do a combined fit, where the normalization of a background is related between two channels (regions in my case) using an ABCD formula. I've been using a standard normfactor so far to vary the normalization in a correlated way, but that also means that the absolute parameter value is the same.
Is there a possibility to define two parameters related by a formula (effectively a scale parameter in my case)?
(I suppose this is more or less the same as pyhf: POI application using formula and corresponding issue)
Related
I want to build an application that would do something equivalent to running lsof (maybe changing it to output differently, because string processing may mean it is not real time enough) in a loop and then associate each line (entries) with what iteration it was present in, what I will be referring further as frames, as later on it will be better for understanding. My intention with it is that showing the times in which files are open by applications can reveal something about their structure, while not having big impact on their execution, which is often a problem. One problem I have is on processing the output, which would be a table relating "frames X entry", for that I am already anticipating that I will have wildly variable entry lengths. Which can fall in that problem of representing on geometry when you have very different scales, the smaller get infinitely small, while the bigger gets giant and fragmentation makes it even worse; so my question is if plotting libraries deal with this problem and how they do it
The easiest and most well-established technique for showing both small and large values in reasonable detail is a logarithmic scale. Instead of plotting raw values, plot their logarithms. This is notoriously problematic if you can have zero or even negative values, but as I understand your situations all your lengths would be strictly positive so this should work.
Another statistical solution you could apply is to plot ranks instead of raw values. Take all the observed values, and put them in a sorted list. When plotting any single data point, instead of plotting the value itself you look up that value in the list of values (possibly using binary search since it's a sorted list) then plot the index at which you found the value.
This is a monotonous transformation, so small values map to small indices and big values to big indices. On the other hand it completely discards the actual magnitude, only the relative comparisons matter.
If this is too radical, you could consider using it as an ingredient for something more tuneable. You could experiment with a linear combination, i.e. plot
a*x + b*log(x) + c*rank(x)
then tweak a, b and c till the result looks pleasing.
I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)
This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.
I have had a look at this post about geohashes. According to the author, the final step in calculating the hash is interleaving the x and y index values. But is this really necessary? Is there a proper reason not to just concatenate these values, as long as the hash table is built according to that altered indexing rule?
From the wiki page
Geohashes offer properties like arbitrary precision and the
possibility of gradually removing characters from the end of the code
to reduce its size (and gradually lose precision).
If you simply concatenated x and y coordinates, then users would have to take a lot more care when trying to reduce precision by being careful to remove exactly the right number of characters from both the x and y coordinate.
There is a related (and more important) reason than arbitrary precision: Geohashes with a common prefix are close to one another. The longer the common prefix, the closer they are.
54.321 -2.345 has geohash gcwm48u6
54.322 -2.346 has geohash gcwm4958
(See http://geohash.org to try this)
This feature enables fast lookup of nearby points (though there are some complications), and only works because we interleave the two dimensions to get a sort of approximate 2D proximity metric.
As the wikipedia entry goes on to explain:
When used in a database, the structure of geohashed data has two
advantages. First, data indexed by geohash will have all points for a
given rectangular area in contiguous slices (the number of slices
depends on the precision required and the presence of geohash "fault
lines"). This is especially useful in database systems where queries
on a single index are much easier or faster than multiple-index
queries. Second, this index structure can be used for a
quick-and-dirty proximity search - the closest points are often among
the closest geohashes.
Note that the converse is not always true - if two points happen to lie on either side of a subdivision (e.g. either side of the equator) then they may be extremely close but have no common prefix. Hence the complications I mentioned earlier.
I'm trying to figure out an algorithm...
Input is a bunch of objects that have multiple values (eg 3 values per object, colour/taste/age, though it could be more).
The algorithm would then distribute the objects into a pre-defined number of sets. Each set should end up with almost the same number of objects (preferably the object count per set shouldn't differ more than 1), and achieve the objective of as fair a distribution of values per set as possible (eg try to have close to as many red in each set, and same for other colours, as well as tastes and ages, etc).
Values are tied to objects and cannot be changed. If you move an object from one set to another it brings all its values.
I found this related question: Algorithm for fair distribution of numbers into two sets
and the "number partitioning problem" suggested seems to help with single value distributions, but I'm looking for information/algorithms with multiple values per object (as described above).
Also note that the values cannot be normalized, ie each object cannot be totalled up into a single value.
Thank you kindly for any assistance.
IMHO, you should approach this as a clustering problem http://en.wikipedia.org/wiki/Cluster_analysis .