How to implement custom Gibbs sampling scheme in pymc - pymc

I have a hidden Markov stochastic volatility model (represented as a linear state space model). I am using a hand-written Gibbs sampling scheme to estimate parameters for the model. The actual sampler requires some fairly sophisticated update rules that I believe I need to write by hand. You can see an example of a Julia version of these update rules here.
My question is the following: how can I specify the model in a custom way and then hand the job of running the sampler and collecting the samples to pymc? In other words, I am happy to provide code to do all the heavy lifting (how to update each block of parameters on each scan -- utilizing full conditionals within each block), but I want to let pymc handle the "accounting" for me.
I realize that I will probably need to provide more information so that others can answer this question. The problem is I am not sure exactly what information will be useful. So, if you feel you can help me out with this, but need more information -- please let me know in a comment and I will update the question.

Here is an example of a custom sampler in PyMC2:
class BDSTMetropolis(mc.Metropolis):
def __init__(self, stochastic):
mc.Metropolis.__init__(self, stochastic, scale=1., proposal_sd='custom',
proposal_distribution='custom', verbose=None, tally=False)
def propose(self):
T = self.stochastic.value
T.u_new, T.v_new = T.edges()[0]
while T.has_edge(T.u_new, T.v_new):
T.u_new, T.v_new = random.choice(T.base_graph.edges())
T.path = nx.shortest_path(T, T.u_new, T.v_new)
i = random.randrange(len(T.path)-1)
T.u_old, T.v_old = T.path[i], T.path[i+1]
T.remove_edge(T.u_old, T.v_old)
T.add_edge(T.u_new, T.v_new)
self.stochastic.value = T
def reject(self):
T = self.stochastic.value
T.add_edge(T.u_old, T.v_old)
T.remove_edge(T.u_new, T.v_new)
self.stochastic.value = T
It pretty different than your model, but it should demonstrate all the parts. Does that give you enough to go on?

Related

How to include an array of weights to adjust importance of observed data in sm.tsa.UnobservedComponents?

I have used the following 5 lines to achieve a kalman filter with your work for a smoothed pricing model, and it worked great.
mod = sm.tsa.UnobservedComponents(obs, 'local level')
lm = sm.OLS(obs, xlm, missing='drop').fit()
obs_noise = abs(lm.resid).mean()
params = [obs_noise, obs_noise / obs_noise_level]
mod_filter, mod_smooth = mod.filter(params), mod.smooth(params)
However currently I would like to adjust the filtering smoothness at certain time, for example, when unemployment rate or interest rate made a big surge, I would like to make the output (Kalman filtered/smoothed) value closer to the observed value, while in most other time I will keep the what it is from the model. So, I have created an array, while a few items greater than 1, and the others will be exactly 1.
e.g.: ir_coeff = np.array([1,1,1,1,1.345,1.23,1.78,1,1,1])
What could be the best approach to achieve this? Thank you a lot in advance.
I have tried to include it in the output file with a dot product operation, however it is not reasonable to do this.

What's difference RobertaModel, RobertaSequenceClassification (hugging face)

I try to use hugging face transformers api.
As I import library , I have some questions. If anyone who know the answer, please tell me your knowledge.
transformers library have several models that are trained. transformers provide not only bare model like 'BertModel, RobertaModel, ... but also convenient heads like 'ModelForMultipleChoice' , 'ModelForSequenceClassification', 'ModelForTokenClassification' , ModelForQuestionAnswering.
I wonder what's difference between bare model adding new linear transformation myself and modelforsequenceclassification.
what's different custom model (pretrained model with random intialized linear) and transformers modelforsequenceclassification.
is ModelforSequenceClassification trained from glue data?
I look forward to someone's reply Thanks.
I think it's easiest to understand if we have a look at the actual implementation, where I randomly chose RobertaModel and RobertaForSequenceClassification as an example. However, the conclusion is valid for all other models, too.
You can find the implementation for RobertaForSequenceClassification here, which looks roughly like this:
class RobertaForSequenceClassification(RobertaPreTrainedModel):
authorized_missing_keys = [r"position_ids"]
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.roberta = RobertaModel(config, add_pooling_layer=False)
self.classifier = RobertaClassificationHead(config)
self.init_weights()
[...]
def forward([...]):
[...]
As we can see, there is no indication about the pretraining here, and it simply adds another linear layer on top (the implementation of the RobertaClassificationHead can be found a bit further down, namely here):
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
So, to answer your question: These models come without any pretrained additional layers on top, and you could easily implement them yourself*.
Now for the asterisk: While it could be easy to wrap this yourself, also note that it is an inherited class RobertaPreTrainedModel. This has several advantages, the most important one being a consistent design between different implementations (sequence classification model, sequence tagging model, etc.). Further, there are some neat functionalities that they are providing, like the forward call including extensive parameters (padding, masking, attention output, ...), which would cost quite some time to implement.
Last but not least, there are existing trained models based on these specific implementations, which you can search for on the Huggingface Model Hub. There, you might find models that are fine-tuned on a sequence classification task (e.g., this one), and then directly load its weights in a RobertaForSequenceClassification model. If you had your own implementation of a sequence classification model, loading and aligning these pre-trained weights would be incredibly more complicated.
I hope this answers your main concern, but feel free to elaborate (either as comment or new question) on any points that have not been addressed!

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

ROC on multiple test sets in h2o (python)

I had a use-case that I thought was really simple but couldn't find a way to do it with h2o. I thought you might know.
I want to train my model once, and then evaluate its ROC on a few different test sets (e.g. a validation set and a test set, though in reality I have more than 2) without having to retrain the model. The way I know to do it now requires retraining the model each time:
train, valid, test = fr.split_frame([0.2, 0.25], seed=1234)
rf_v1 = H2ORandomForestEstimator( ... )
rf_v1.train(features, var_y, training_frame=train, validation_frame=valid)
roc = rf_v1.roc(valid=1)
rf_v1.train(features, var_y, training_frame=train, validation_frame=test) # training again with the same training set - can I avoid this?
roc2 = rf_v1.roc(valid=1)
I can also use model_performance(), which gives me some metrics on an arbitrary test set without retraining, but not the ROC. Is there a way to get the ROC out of the H2OModelMetrics object?
Thanks!
You can use the h2o flow to inspect the model performance. Simply go to: http://localhost:54321/flow/index.html (if you changed the default port change it in the link); type "getModel "rf_v1"" in a cell and it will show you all the measurements of the model in multiple cells in the flow. It's quite handy.
If you are using Python, you can find the performance in your IDE like this:
rf_perf1 = rf_v1.model_performance(test)
and then print the ROC like this:
print (rf_perf1.auc())
Yes, indirectly. Get the TPRs and FPRs from the H2OModelMetrics object:
out = rf_v1.model_performance(test)
fprs = out.fprs
tprs = out.tprs
roc = zip(fprs, tprs)
(By the way, my H2ORandomForestEstimator object does not seem to have an roc() method at all, so I'm not 100% sure that this output is in the exact same format. I'm using h2o version 3.10.4.7.)

Managing Setup code with TimeIt

As part of a pet project of mine, I need to test the performance of various different implementations of my code in Python. I anticipate this to be something I do alot of, and I want to try to make the code I write to serve this aim as easy to update and modify as possible.
It's still in its infancy at the moment, but I've taken to using strings to manage common setup or testing code, eg:
naiveSetup = 'from PerformanceTests.Vectors import NaiveVector\n' \
+ 'left = NaiveVector([1,0,0])\n' \
+ 'right = NaiveVector([0,1,0])'
This allows me to only write the code once, at the expense of making it harder to read and clunky to update.
Is there a better way?
Use triple quotes """
setup_code = """
from PerformanceTests.Vectors import NaiveVector
left = NaiveVector([1,0,0])
right = NaiveVector([0,1,0])
"""
Another interesting method is provided in the docs of timeit:
def test():
"Stupid test function"
L = []
for i in range(100):
L.append(i)
if __name__=='__main__':
from timeit import Timer
t = Timer("test()", "from __main__ import test")
print t.timeit()
Though this isn't suitable for all needs.
Timing code is fine, but it will still leave you guessing what's going on.
To find out what's actually going on, manually pause it a few random times in the debugger, and examine the call stack.
For example, in the code that is 30x slower in one implementation than in another, each sample of the stack has a 96.7% chance of falling in the extra time that it is spending, so you can see why.
No guesswork required.

Resources