Should I adjust the weights of embedding of newly added tokens? - huggingface-transformers

I'm a beginner of neural language processing. Recenttly, I try to train a text generation model based on GPT-2 with huggingface transformers. I added some new tokens to the tokenizer and resize the embedding of the model with model.resize_token_embeddings(len(tokenizer)). Suppose I added 6 new tokens, should I add the weights of the 6 tokens to the optimizer? How should I do it? Thank you very much!

Just call the resize_token_embeddings function:
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')
ATTR_TO_SPECIAL_TOKEN = {'additional_special_tokens': ['SPEC1', 'SPEC2']}
orig_num_tokens = len(gpt2_tokenizer)
num_added_tokens = gpt2_tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN) # doesn't add if they are already there
if num_added_tokens > 0:
gpt2_model.resize_token_embeddings(new_num_tokens=orig_num_tokens + num_added_tokens)

Related

Set the name for each ParallelFor iteration in KFP v2 on Vertex AI

I am currently using kfp.dsl.ParallelFor to train 300 models. It looks something like this:
...
models_to_train_op = get_models()
with dsl.ParallelFor(models_to_train_op.outputs["data"], parallelism=100) as item:
prepare_data_op = prepare_data(item)
train_model_op = train_model(prepare_data_op["train_data"]
...
Currently, the iterations in Vertex AI are labeled in a dropdown as something like for-loop-worker-0, for-loop-worker-1, and so on. For tasks (like prepare_data_op, there's a function called set_display_name. Is there a similar method that allows you to set the iteration name? It would be helpful to relate them to the training data so that it's easier to look through the dropdown UI that Vertex AI provides.
I reached out to a contact I have at Google. They recommended that you can pass the list that is passed to ParallelFor to set_display_name for each 'iteration' of the loop. When the pipeline is compiled, it'll know to set the corresponding iteration.
# Create component that returns a range list
model_list_op = model_list(n_models)
# Parallelize jobs
ParallelFor(model_list_op.outputs["model_list"], parallelism=100) as x:
x.set_display_name(str(model_list_op.outputs["model_list"]))

How to remove input from from generated text in GPTNeo?

I'm writing a program to generate text...
I need to remove the input from the generated text. How can I do this?
The code:
input_ids = tokenizer(context, return_tensors="pt").input_ids
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.8,
top_p=0.9)
strs = tokenizer.batch_decode(gen_tokens)[0]
Here the strs contains the input I've given...
How to remove that?
The Transformers library does not provide you with a way to do it, but this is something you can easily achieve with 1 line of code:
strs = strs.replace(context,"")
This is actually what I'm doing behind my NLP Cloud API as it uses Transformers behind the hood.

How can I extract, edit and replot a data matrix in Abaqus?

Good afternoon,
We´ve been working on an animal model (skull) applying a series of forces and evaluating the resultant stresses in Abaqus. We got some of those beautiful and colourful (blue-to-red) contour-plots. Now, we´d like to obtain a similar image but coloured by a new matrix, which will be the result of some methematical transformations.
So, how can I extract the data matrix used to set those colour patterns (I guess with X-, Y-, Z-, and von Mises-values or so), apply my transformation, and replot the data to get a new (comparable) figure with the new values?
Thanks a lot and have a great day!
I've never done it myself but I know that this is possible. You can start with the documentation (e.g. here and here).
After experimenting using GUI you can check out the corresponding python code which should be automatically recorded in the abaqus.rpy file at your working directory (or at C:\temp). Working it trhough you could get something like:
myodb = session.openOdb('my_fem.odb') # or alternatively `session.odbs['my_fem.odb']` if it is already loaded into the session
# Define a temporary step for accessing your transformed output
tempStep = myodb.Step(name='TempStep', description='', domain=TIME, timePeriod=1.0)
# Define a temporary frame to storeyour transformed output
tempFrame = tempStep.Frame(frameId=0, frameValue=0.0, description='TempFrame')
# Define a new field output
s1f2_S = myodb.steps['Step-1'].frames[2].fieldOutputs['S'] # Stress tensor at the second frame of the 'Step-1' step
s1f1_S = myodb.steps['Step-1'].frames[1].fieldOutputs['S'] # Stress tensor at the first frame of the 'Step-1' step
tmpField = s1f2_S - s1f1_S
userField = tempFrame.FieldOutput(
name='Field-1', description='s1f2_S - s1f1_S', field=tmpField
)
Now, to display your new Field Output using python you can do the following:
session.viewports['Viewport: 1'].odbDisplay.setFrame(
step='TempStep', frame=0
)
For more information on used methods and objects, you can consult with the documentation "Abaqus Scripting Reference Guide":
Step(): Odb commands -> OdbStep object -> Step();
Frame(): Odb commands -> OdbFrame object -> Frame();
FieldOutput object: Odb commands -> FieldOutput object;

Any ar js multimarkers learning tutorial?

I have been searching for ar.js multimarkers tutorial or anything that explains about it. But all I can find is 2 examples, but no tutorials or explanations.
So far, I understand that it requires to learn the pattern or order of the markers, then it stores it in localStorage. This data is used later to display the image.
What I don't understand, is how this "learner" is implemented. Also, the learning process is only used once by the "creator", right? The output file should be stored and then served later when needed, not created from scratch at each person's phone or computer.
Any help is appreciated.
Since the question is mostly about the learner page, I'll try to break it down as much as i can:
1) You need to have an array of {type, URL} objects.
A sample of creating the default array is shown below (source code):
var markersControlsParameters = [
{
type : 'pattern',
patternUrl : 'examples/marker-training/examples/pattern-files/pattern-hiro.patt',
},
{
type : 'pattern',
patternUrl : 'examples/marker-training/examples/pattern-files/pattern-kanji.patt',
}]
2) You need to feed this to the 'learner' object.
By default the above object is being encoded into the url (source) and then decoded by the learner site. What is important, happens on the site:
for each object in the array, an ArMarkerControls object is created and stored:
// array.forEach(function(markerParams){
var markerRoot = new THREE.Group()
scene.add(markerRoot)
// create markerControls for our markerRoot
var markerControls = new THREEx.ArMarkerControls(arToolkitContext, markerRoot, markerParams)
subMarkersControls.push(markerControls)
The subMarkersControls is used to create the object used to do the learning. At long last:
var multiMarkerLearning = new THREEx.ArMultiMakersLearning(arToolkitContext, subMarkersControls)
The example learner site has multiple utility functions, but as far as i know, the most important here are the ArMultiMakersLearning members which can be used in the following order (or any other):
// this method resets previously collected statistics
multiMarkerLearning.resetStats()
// this member flag enables data collection
multiMarkerLearning.enabled = true
// this member flag stops data collection
multiMarkerLearning.enabled = false
// To obtain the 'learned' data, simply call .toJSON()
var jsonString = multiMarkerLearning.toJSON()
Thats all. If you store the jsonString as
localStorage.setItem('ARjsMultiMarkerFile', jsonString);
then it will be used as the default multimarker file later on. If you want a custom name or more areas - then you'll have to modify the name in the source code.
3) 2.1.4 debugUI
It seems that the debug UI is broken - the UI buttons do exist but are nowhere to be seen. A hot fix would be using the 'markersAreaEnabled' span style for the div
containing the buttons (see this source bit).
It's all in this glitch, you can find it under the phrase 'CHANGES HERE' in the arjs code.

How to parse a list of sentences?

I want to parse a list of sentences with the Stanford NLP parser.
My list is an ArrayList, how can I parse all the list with LexicalizedParser?
I want to get from each sentence this form:
Tree parse = (Tree) lp1.apply(sentence);
Although one can dig into the documentation, I am going to provide code here on SO, especially since links move and/or die. This particular answer uses the whole pipeline. If not interested in the whole pipeline, I will provide an alternative answer in just a second.
The below example is the complete way of using the Stanford pipeline. If not interested in coreference resolution, remove dcoref from the 3rd line of code. So in the example below, the pipeline does the sentence splitting for you (the ssplit annotator) if you just feed it in a body of text (the text variable). Have just one sentence? Well, that is ok, you can feed that in as the text variable.
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefChainAnnotation.class);
Actually documentation from Stanford NLP provide sample of how to parse sentences.
You can find the documentation here
So as promised, if you don't want to access the full Stanford pipeline (although I believe that is the recommended approach), you can work with the LexicalizedParser class directly. In this case, you would download the latest version of Stanford Parser (whereas the other would use CoreNLP tools). Make sure that in addition to the parser jar, you have the model file for the appropriate parser you want to work with. Example code:
LexicalizedParser lp1 = new LexicalizedParser("englishPCFG.ser.gz", new Options());
String sentence = "It is a fine day today";
Tree parse = lp.parse(sentence);
Note this works for version 3.3.1 of the parser.

Resources