Using multiple models with repeated keywords in `hydra` - yaml

I'm pretty new to hydra and I'm trying to better understand the config.yaml file. I'm undertaking a deep learning experiment where I have two separate models, an embedding network and a simple fully connected neural network. The first one is going to create features, and the second is basically fine-tuning the results.
I would like to quickly access some parameters relative to the configuration for both models. For now I just tried to incorporate everything in the same config.yaml file
parameters_embnet:
_target_: model.EmbNet_Lightning
model_name: 'EmbNet'
num_atom_feats: 200
dim_target: 128
loss: 'log_ratio'
lr: 1e-3
wd: 5e-6
data_embnet:
_target_: data.CompositionDataModule
dataset_name: 's'
batch_size: 64
data_path: './s.csv'
wandb_embnet:
_target_: pytorch_lightning.loggers.WandbLogger
name: embnet_logger
trainer_embnet:
max_epochs: 1000
parameters_nn:
_target_: neuralnet.SimpleNeuralNetwork_Lightning
input_size: 200
lr: 1e-3
wd: 5e-6
loss: 'log_ratio'
data_nn:
_target_: neuralnet.nn_dataset_lightning
batch_size: 128
wandb_nn:
_target_: pytorch_lightning.loggers.WandbLogger
name: neuralnet_logger
trainer_nn:
max_epochs: 150
but trying to use such configuration results in a ConstructorError since some keys (like lr) are duplicated across the two models. Now, I'm just wondering whether this is the correct way to proceed, or if I should set up multiple config.yaml files and what's the most optimal way to do that.

It's not clear exacty what you are trying to do, but it is not legal to have the same key mutliple times.
This block in particular looks like it both have the same keys multiple times and is incorrectly indented.
parameters_nn:
_target_: neuralnet.SimpleNeuralNetwork_Lightning
input_size: 200
lr: 1e-3
wd: 5e-6
loss: 'log_ratio'
lr: 1e-3

Based upon OP comment: I would like to quickly access some parameters relative to the configuration for both models I infer a question related to a concept of common params, a base param set plus customization over a couple of key concepts.
See my post in Use a parameter multiple times in hydra config file. I give a secondary example that may answer your implied question.
.. Otto

Related

Meaning of diff_from_typical in ElasticSearch Machine Learning custom rules

When configuring Machine Learning jobs in ES, you can customise your detectors by using custom_rules.
I'm wondering about the actual meaning of the diff_from_typical (one of the values that applies_to can take). My main question is if diff_from_typical considers absolute difference or not. I know that you can use lt or gt operators later (among others) but let's image the following situation:
I have a custom rule for two jobs. The rule is the same but the cases scenarios are different. Let's say that the custom rule is:
"custom_rules": [{
"actions": ["skip_model_update"],
"conditions": [
{
"applies_to": "diff_from_typical",
"operator": "gt",
"value": 2000
}
]
}]
Case scenario A:
Typical value: 5000
Actual value: 2000
diff_from_typical: 5000 - 2000 = 3000
Case scenario B:
Typical value: 5000
Actual value: 8000
diff_from_typical: 5000 - 8000 = -3000
Will the aforementioned custom rule apply in both cases? I mean, using the absolute difference from typical? Or will it only work in the first case (case A)?
I assume that if it only works for the first case, I should write the "inverse" custom rule to manage both cases.
Thanks in advance!
The question was answered in the ES forum: https://discuss.elastic.co/t/explain-diff-from-typical-meaning-in-custom-rules/304419
Basically, it means absolute difference, so it covers both sides.

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

How to increase the maximum title length from 100 to 200 in Testlink

In "Create Test case" option, I want to give test case name length more than 100 characters. But it takes only 100 characters and trims the rest of the name. I want to change that limit to 200 characters.
Kindly guide me as in which file in TestLink-1.9.7 I need to make changes and where?
That's easy!
Please read the guidelines from:
http://www.embeddedsystemtesting.com/2011/11/how-to-change-test-case-title-in-test.html
In general what you need to do is modification of DB design for one field and modification of one file:
In Database: change designed value of Name field in nodes_hierarchy table to 255 (default is 100)
Modify /gui/templates/input_dimensions.conf such way
Search and increase
SRS_TITLE_MAXLEN=255 /default is 100
REQ_SPEC_TITLE_MAXLEN=255/default is 100
TESTCASE_NAME_MAXLEN=255 /default is 100
In case you need to increase the limits for other items, you can do the changes described in step 2 above for other parameters as well, like: TESTPLAN_NAME_MAXLEN, CONTAINER_NAME_MAXLEN etc
For use case of creating test cases using import you need also to modify the following file: /cfg/const.inc.php
$g_field_size->node_name = 255
$g_field_size->testcase_name = 255
etc (depending on which items' limits you want to increase - names of test cases, test suites, requirements and so on)

Questions about creating stanford CoreNLP training models

I've been working with Stanford's coreNLP to perform sentiment analysis on some data I have and I'm working on creating a training model. I know we can create a training model with the following command:
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
I know what goes in the train.txt file. You score sentences and put them in train.txt, something like this:
(0 (2 Today) (0 (0 (2 is) (0 (2 a) (0 (0 bad) (2 day)))) (..)))
But I don't understand what goes in the dev.txt file.
I read through this question several times to try to understand what goes in dev.txt, but it's still unclear to me. Also, scoring these sentences manually has become a pain, is there a tool available that makes it easier? I'm worried that I've been using the wrong number of parentheses or some other stupid mistake like that.
Also, any suggestions on how long my train.txt file should be? I'm thinking of scoring a 1000 sentences. Is that number too small, too large?
All your help is appreciated :)
dev.txt should be the same as train.txt just with a different set of sentences. Note that the same sentence should not appear in dev.txt and train.txt. The development set is used to evaluate the quality of the model you train on the training data.
We don't distribute a tool for tagging sentiment data. This class could be helpful in building data: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html
Here are the sizes of the train, dev, and test sets used for the sentiment model: train=8544, dev=1101, test=2210
Here is some sample code for evaluating a model
// load a model
SentimentModel model = SentimentModel.loadSerialized(modelPath);
// load devTrees
List<Tree> devTrees;
devTrees = SentimentUtils.readTreesWithGoldLabels(devPath);
// evaluate on devTrees
Evaluate eval = new Evaluate(model);
eval.eval(devTrees);
eval.printSummary();
You can find what you need to import, etc... by looking at:
edu/stanford/nlp/sentiment/SentimentTraining.java

Bestw.d format syntax in SAS

I am converting a character variable to a numeric variable. I am using a bestw.d format. I also tried just best. as the format in the input statement and this worked fine. I cant find any mention of just using best. instead of bestw. in SAS help, though I know from SAS help that the d can be omitted. I have been playing around with using just the best.and I am wondering if there is a default w assigned when just using best..
All formats have a default w. It is not generally good practice to use best. (or <format>.) in most cases, as you should know and plan for the specific width needed, but it always exists.
Also, informat and format have different defaults in many cases where there are identically named informat and format.
In the case of bestw., the default length is 12. See this documentation page for details.
I always find it's worth using a worked example, this shows the different outcomes when using lengths on the BEST. format:
data _NULL_;
a=1031564321300.302;
put '==================================';
put 'Different "BEST" formats';
put '==================================';
put 'BEST8. - ' a best8.;
put 'BEST12. - ' a best12.;
put 'BEST13. - ' a best13.;
put '==================================';
put 'BEST. - ' a best.;
put '==================================';
run;
You can run this in your environment and check the outcome. On my machine it looks like this:
==================================
Different "BEST" formats
==================================
BEST8. - 1.032E12
BEST12. - 1.0315643E12
BEST13. - 1031564321300
==================================
BEST. - 1.0315643E12
==================================
i.e. It looks like BEST12. is the matching format when no width is specified.

Resources