Specifying the form of names for categories generated by patsy/statsmodels 'C' - categorical-data

By default, Patsy's C seems to generate categories with names of the form
C(color, Treatment('White'))[T.Green]
at least when used in a formula provided to statsmodels old. Is there a way to specify that C generate less verbose category names, e.g., of the form
colorGreen
or even simply
Green

There's an issue for this open. Please discuss alternatives there.
https://github.com/pydata/patsy/issues/19

Bit late to the party but for those searching this in 2021.
If you're prepared todo a bit of wrangling, you can take apart the statsmodel Summary object (returned when calling summary() on a fitted model), convert it to a DataFrame, and format it from there.
The Summary object has a tables attribute. The first is the result of the fit, the second is the coefficents table. The tables have an as_html() method that you can pass to the pandas read_html() method.
df = pd.read_html(your_fitted_model.summary().tables[1].as_html(), header=0)[0]
From there you can strip out the patsy formatting via regular string and dataframe methods.

Related

Web2py Number Formatting for Thousands

I'm sort of new to Web2py. I have a system that's working just fine, but I want to make an improvement regarding visualization. There's a couple of fields that use numbers (defined as double in their respective define_table methods) to represent currency, but I want them to also show with a separator for thousands, such as 183,403,293.34. I checked some documentation, but I couldn't find a direct way to handle this form of customization, though I could be missing something.
Any suggestions regarding this? Cheers!
First, if representing currency, you should use the decimal field type rather than double (some calculations using double values may yield incorrect results due to the use of floating point representations internally). However, if using SQLite, there is no distinction between decimal and double, so in that case, you might want to multiply all values by 100 and instead store integers.
In any case, to display a given numeric value with thousands separators in Python, you can do:
'{:,}'.format(myvalue)
For more details, see https://stackoverflow.com/a/10742904/440323 and https://stackoverflow.com/a/21208495/440323.
If you are using the values via web2py functionality that makes use of the field's represent function (e.g., the grid or the .render() method), you can define a custom represent function, such as:
Field('amount', 'decimal(12, 2)',
represent=lambda v, r: '{:,}'.format(v) if v is not None else '')
You could use the Python function of the locale module:
{{= locale.format ('%. 2f', your_value, grouping = True)}}

Change lgbm internal parameter (threshold) by hand

I have trained a model with lgbm. I can dump its interval values with
booster.dump_model()
and see all the internal parameters that has been optimized during the training (leaf values, threshold, index of the variables for each split, ...). For testing purpose I would like to change some. Is there a way? I guess that changing just the output of dump_model will do nothing.
You can save your model to a human-understandable format using
booster.save_model('model.txt'), do your modifications on model.txt, and load back the modified model using modified_booster = lightgbm.Booster(model_file='model.txt').
I hope it helps!

Gensim's FastText KeyedVector out of vocab

I want to use the read-only version of Gensim's FastText Embedding to save some RAM compared to the full model.
After loading the KeyVectors version, I get the following Error when fetching a vector:
IndexError: index 878080 is out of bounds for axis 0 with size 761210
The error occurs when using words that should be out-of-vocabulary e.g. "lawyerxy" instead of "lawyer". The full model returns a vector for both.
from gensim.models import KeyedVectors
model = KeyedVectors.load("model.kv")
model .wv.__getitem__("lawyerxy")
So, my assumption is that the KeyedVectors do not offer FastText's out of vacabulary function - a key feature for my usecase. This limitation is not given in the documentation:
https://radimrehurek.com/gensim/models/word2vec.html
Can anyone prove that assumption and/or name a fix to allow vectors for "lawyerxy" etc. ?
The KeyedVectors name is (as of gensim-3.8.0) just an alias for class Word2VecKeyedVectors, which only maintains a simple word (as key) to vector (as value) mapping.
You shouldn't expect FastText's advanced ability to synthesize vectors for out-of-vocabulary words to appear in any model/representation that doesn't explicitly claim to offer that ability.
(I would expect a lookup of an out-of-vocabulary word to give a clearer KeyError rather than the IndexError you've reported. But, you'd need to show exactly what code created the file you're loading, and triggered the error, and the full error stack, to further guess what's going wrong in your case.)
Depending on how your model.kv file was saved, you might be able to load it, with retained OOV-vector functionality, by using the class FastTextKeyedVectors instead of plain KeyedVectors.

Correlating multiple dynamic values

How can I get the value of important id and ValueType?
I have tried using web_save_param_regexp (but unfortunately I don't fully understand how the function works).
I have also tried using web_save_param (with the help of offset and length).
unfortunately once again I cannot get the accurate value some values change in length specially when the total amount values dynamically changes per run.
<important id=\"insertsomevalueshere\" record=\"1\" nucTotal=\"NUC609.40\"><total amount=\"68.75\" currency=\"USD\"/><total amount=\"609.40\" currency=\"USD\"/><out avgsomecost=\"540.65\" ValueType=\"insertsomevalueshere\" containsawesomeness=\"1\" Score=\"-97961\" somedatatype=\"1\" typeofData=\"VAL\" web=\"1\">
Put these lines of code before the line of code which does your web request:
web_reg_save_param_regexp("ParamName=importantid","Regexp=<important id=\\\"(.*?)\\\"",LAST);
web_reg_save_param_regexp("ParamName=ValueType","Regexp= ValueType=\\\"(.*?)\\\"",LAST);
You will then have two stored parameters 'importantid' and 'ValueType'
Dynamic number of elements to correlate? Your path for resubmission is through web_custom_request(). You will need to build the string you need dynamically with the name:value pairs for all of the data which needs to be included.
This path will place a premium on your string manipulation skills in the language of the tool. The default path is through C, but you have other language options if your skills are more refined in another language.

How can I retrieve object keys from a sequence in freemarker?

I have a list of objects that are returned as a sequence, I would like to retrieve the keys of each object so as to be able to display the object correctly. At the moment I try data?first?keys which seems to get something like the queries that return the objects (Not sure how to explain that last sentence either but img below shows what I'm trying to explain).
The objects amount of objects returned are correct (7) but displaying the keys for each object is my aim. The macro that attempts this is here (from the apache ofbiz development book chapter 8).
Seems like it my sequence is a list of hashes and as explained by Daniel Dekany this post:
The original problem is that, someHash[key] expects a
string as key. Because, the hash type of FTL, by definition, maps
string keys to arbitrary values. It's not the same as Java's Map.
(Note that to further complicate the matters, in FTL
someSequenceOrString[index] expects an integer index. So, the [] thing
is used for that too.) Now someBeanWrappedMap(key) has technically
nothing to do with all the []-s, it's just a method call, so it
accepts all kind of keys. If you have a Map with non-string keys, you
must use that.
Thanks D Dekany if you're on stack, this ended my half day frustration with the ftl template.

Resources