With Gensim < 4.0, we can retrain a word2vec model using the following code:
model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)
However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.
How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?
I don't think that code would've ever have worked in Gensim versions before 4.0. A plain list-of-word-vectors, like GoogleNews-vectors-negative300.bin, does not (& never has) had enough info to continue training.
It's missing the hidden-to-output layer weights & word-frequency info essential for training.
Looking at past source code, as of release 1.0.0 (February 2017), that code wouldn've already given a deprecation-error with a pointer to the method for loading a plain set-of-word-vectors - to address people with the mistaken notion that could work – and raised other errors on any attempts to train() such a model. (Pre-1.0.0, docs also warned that this would not work, & would have failed with a less-helpful error.)
As one of those errors mentioned, there has at times been experimental support for loading some of a prior set-of-word-vectors to clobber any words in an existing model's already-initialized vocabulary, via .intersect_word2vec_format(). But by default that both (1) locks the imported vectors against further change; (2) brings in no new words. That's unlike what people most often want from "fine-tuning", so it's not a ready-made help for that goal.
I believe some people have cobbled together custom code to achieve various kinds of fine-tuning in their projects – but I don't know of anyone who's published a reliable recipe or strong results. (And I suspect some of the people who think they're doing this well just haven't rigorously evaluated the steps they are taking.)
If you have any recipe you know worked pre-Gensim-4.0.0, it should be adaptable - 4.0 changes to the Word2Vec-related classes were mainly refactorings, optimizations, & new options (with little-to-none removal of functionality). But a reliable description of what used-to-work, or which particular fine-tuning strategy is being pursued for what specific benefits, to make more specific recommendations.
Related
I am trying to use the Stanford Shift Reduce Parser with the Spanish model supplied. I am noticing, however, that unlike the Lexicalized Parser, I cannot get the TypedDependencies, despite sending the adequate flag -outputFormat typedDependencies, as it can be seen in lexparser.bat/sh.
Just in case, this is the Java code I'm using to pass the flags and creating the parser.
ShiftReduceParser model = ShiftReduceParser.loadModel(modelPath);
model.setOptionFlags("-factored", "-outputFormat", "penn,typedDependencies");
ArrayList<TaggedWord> taggedWords = new ArrayList<TaggedWord>();
Thank you
The problem here is not the ShiftReduceParser, but simply that we don't currently support typed dependencies for Spanish currently - we only have them for English and Chinese.
(Looking ahead, the most likely thing to appear first is support for Universal Dependencies in the Neural Network Dependency Parser. Indeed, you could probably train such a model yourself now.)
So iOS 6 deprecates presentModalViewController:animated: and dismissModalViewControllerAnimated:, and it replaces them with presentViewController:animated:completion: and dismissViewControllerAnimated:completion:, respectively. I suppose I could use find-replace to update my app, although it would be awkward with the present* methods, since the controller to be presented is different every time. I know I could handle that situation with a regex, but I don't feel comfortable enough with regex to try using it with my 1000+-files-big app.
So I'm wondering: Does Xcode have some magic "update deprecated methods" command or something? I mean, I've described my particular situation above, but in general, deprecations come around with every OS release. Is there a better way to update an app than simply to use find-replace?
You might be interested in Program Transformation Systems.
These are tools that can automatically modify source code, using pattern-directed source-to-source transformations ("if you see this source-level pattern, replace it by that source-level pattern") that operate on code structures rather than text. Done properly, these transformations can be reliable and semantically correct, and they're a lot easier to write than low-level procedural code that navigates and smashes nanoscopic actual tree structures.
It is not the case that using such tools is easy; such tools have to know how to parse the language of interest into compiler data structures, (e.g., ObjectiveC), process the patterns, and regenerate compilable source code from the modified structures. Even with the basic transformation engine, somebody needs to carefully define parsers (and unparsers!) for the dialects of the languages of interest. And it takes time to learn how to use such a even if you have such parsers/unparsers. This is worth it if the changes you need to make are "regular" (in the program transformation sense, not the regexp sense) and widespread (as yours seem to be).
Our DMS Software Reengineering toolkit has an ObjectiveC front end, and can carry out such transformations.
no there is no magic like that
Being somewhat new to Ruby I'm exploring existing libraries to do what I'd normally do in other scripting languages, and I'm a bit stumped by the localization libraries that might be available for something built on top of Sinatra/Sequel (Rails/AR being a bit too opinionated to my taste).
Now, I ran into a couple (i18n, r18n, GetText) though this wiki page, and there apparently is an extra library used in Padrino (based on the i18n thing from Rails?); and apparently plenty more.
Except for the obvious (i.e. GetText mo/po style vs yml files), I'm somewhat confused as to how these options might be different. The wiki doesn't point to much in that respect except saying the that they exist; not how they're different.
Adding to this confusion is the fact that essentially every piece of documentation seems to cover a single one of them (and typically in a RoR context). Moreover, these options don't look entirely incompatible with one another on closer inspection -- in the sense that, if I understood this properly, they can understand each other's files to a large extent.
Might anyone here be able to give a quick and to the point explanation/overview of these libraries, and outline the difference between the them? Some pointers on performance would also be welcome, if you're aware of any (besides the ones from the fast_gettext docs, which made little sense considering my lack of understanding the difference between these options).
I can see how this situation is confusing without knowing some of the history of i18n/l10n libraries in Ruby. I should probably write a few words up on that, but for now I'll try to give an overview from my perspective:
Gettext is obviously the oldest player in this game and it inherits both strengths and weaknesses from its ancestry which is being invented for a C dominated world. It has most features one needs, comes with some tools support that others lack (like desktop po file editors) and is widely accepted in the so called enterprise world.
Gettext as such defines an API and there are basically two libraries that implement it in the Ruby world, the traditional Ruby Gettext packages by Masao Mutoh and the fast_gettext gem by Michael Grosser.
Ruby Gettext is quite powerful and ships a lot of features that you may or may not need. The fast_gettext gem on the other hand focusses on raw speed and is implemented as a shiny, modern code-style Ruby library that is easily hackable and the author is a very smart and supportive person. Out of the two I'd personally strongly recommend fast_gettext.
The I18n gem is the result of the joint effort of various Ruby i18n/l10n solutions that existed a few years ago and that all strived to supersede Gettext for various reasons at that point of time. The resulting I18n API is basically covering the requirements and usecases of all the i18n/l10n solutions involved at that time, including the API of Gettext. So, today's Ruby I18n API is a superset of Gettext's API from the early 90s.
Today the I18n gem is the official solution that is shipped with Ruby on Rails, but it is also the probably most popular one in the Ruby world in general.
The I18n gem also makes it very easy to extend the featureset and add things like caching, other storage mechanisms (like Gettext po files, database tables, key-value stores; storage defaults to plain Ruby files and YAML) etc. and it ships with a number of modules for that (but external or custom modules can easily be crafted, tested and integrated).
There are translation files for 70+ languages (locales) for strings used by Ruby on Rails (which are useful in other projects, too) maintained by the community.
I can not tell much about R18n except that it was invented right after I18n hit its first release and as far as I remember it originated from the Merb community. It seems to be rather strong in the Russian Ruby world, but I might be wrong with all these assertions.
So, unless you have a very good reason to pick any other solution I'd strongly recommend using I18n.
Then on the other hand that means nothing because I've been leading this project more or less since it was invented.
I hope this helps.
[EDIT] added links to various references
I18n is a main stream.
R18n is a alternative with some extra features (model translations, syntax sugar) and some difference in ideology and architecture (flex extensibility by powerful filters).
G18n need to add model transtions to I18n.
Padrino is not a i18n library, it is just Sinatra framework with build-in I18n.
Gettext is IMHO old conception with very ugly format and problem with pluralization. Anyway, it isn’t popular in Ruby community.
First:
as svenfuchs wrote, I18n is a framework that provides modules for many translation and internationalisation approaches.
'gettext' is just one of many modules.
So there is really no question to use I18n.
The default setup of a Rails application is to use I18n with the YAML backend and I understand part of your question to compare that backend with other ones.
IMHO there are two major differences between the gettext and YAML based approaches:
life cycle support
hierarchy
gettext
One idea of gettext is, that translating an app is not a singular event but a life cycle process.
It is build to support this live cycle.
gettext is designed to use plain english as the keys for the translations. So the idea is to write the app in english and mark all text that is to be translated, typically by wrapping it with _().
As a result, the app source code is easily readable in english.
Then a programm scans all source code and extracts the textes to be translated and builds a repository (the .pot file) of these textes.
In the next step, and here comes the live cycle, the repository is merged with existing translations (.po files, one for each target language) and new or changed items are marked.
Mature editors support the translators by focusing on the new and changed items. Additionally project specific dictionaries can support partial automatic translations.
gettext is flat, meaning that each key phrase is translates exactly once in the translation files. There is no hierarchy. But there is context. In the translation files, all the source code positions of a key phrase are listed. An editor with access to the source code can display the source along with the translation (and some do).
Finally, .po files are translated to machine readable fast access forms (can be .mo, the classic standard, or a database or json or …)
YAML
YAML an the other hand is hierarchical so it’s easy to have variations of translations in different contexts.
I18n uses this structure to support scopes and uses the current file path as scope when using keys starting with a dot.
There is no information, where a key is used in the project (well unless auto scoped, but the key may be used in other places explicitly).
There is no information, whether there are any changes.
Unless your IDE supports you, the developer has to find the right place to put a key in the YAML and searching the usage can be cumbersome.
A lot more is said in the other answers.
I18n
I intentionally said YAML and not I18n, because I18n is a framework for internationalization (not only translation), and YAML is only one possible backend.
Plural support in I18n differs from plural support of vanilla gettext. I don’t have experience how they cooperate.
Examples
gettext with positional parameters:
sprintf(
_('Do you really want to delete tour %1$s_%2$s? Only empty tours can be deleted!'),
tag, idx)
translations are text files, but PO-Editors provide GUIs:
#: js/addDelRow.js:15
msgid "" "Do you really want to delete tour %1$s_%2$s? Only empty tours can be deleted!"
msgstr "" "Wollen sie die Spalte %1$s_%2$s wirklich löschen? Nur leere Spalten können "
"gelöscht werden."
YAML with parameters:
Source
<%= t('.checked_at', ts: l(checked_at), user: full_name) %>
translation
from
en:
hotels:
form:
checked_at: „set to checked by %{user} on %{ts}“
to
de:
hotels:
form:
checked_at: "geprüft gesetzt am %{ts} von %{user}“
Conclusion
YAML is much easier to start with, especially if you have support by an IDE.
Vanilla RAILS has it built in.
The is no native language. The first translation can be any language.
With growing projects and multiple laguages, my YAML files tend to repetition (same translation scattered over the hierarchy) and tracing of changes and therefore new translations is cumbersome.
gettext needs an extra toolchain and therefore a more difficult setup.
It supports the whole life cycle of continous translation of developing apps.
It is based on english source code.
I usually use the best parts of both, using YAML for internationalisation (number and date format, maybe model names?) and gettext for translation.
Andrey's response as to point me back to the r18n docs, which basically break it down to a single line:
R18n uses hierarchical, not English-centric, YAML format for translations by default.
Found this slideshare from Andrey. It's in Russian, but it's making a lot more sense now (slides 7 to 9 in particular clear-cut differences between i18n and r18n):
http://www.slideshare.net/iskin/r18n
I am an engineering student, and deciding upon my final year project.
One of the many candidates is an online UML tool with code generation facilities. But I did not take compiler designing classes, so I am not much aware of the code generation techniques.
I want to know about the techniques that I should look to study in order to build something like this. If these techniques are as complicated as writing a compiler, then perhaps I will have to abandon this idea.
Compilation is really the opposite of the kind of code generation you are describing, so I don't think you need to know how to write a compiler.
Code generation can be as simple as combining text strings or using templates, or as complex as using Reflection.Emit to create classes at runtime.
I would start with this Wikipedia article.
The creation of an UML tool is a long term project. You need many to acquire different expertises which can not be known by just one member of the team.
Your academic project is too ambitious.
An easy project which has never been done is to generate code from an activity or state diagram. You should not try to recreate the graphical editor because this is very very complex but only to take the xmi export and generate code from it using a xml parser. This would be a good 6 months project for your thesis :-)
Most UML tools generate source code. The generation is normally quite a bit simpler than a compiler as well. For example, a class diagram will have a collection of data structures representing classes and links between those classes (inheritance). To generate output, you walk through the class objects, and for each you "print" out a representation of that object in the syntax of the target language.
I'm not sure exactly what capabilities your code generation will require, but the UML tools that I have used are not very sophisticated in their code generation.
Tools that I have used simply create files and drop your function names into them with arguments derived from the inputs. This would not require any understanding of compilers. Most of the difficulty would be in the user interface and how you store the data to make code generation easy.
You can just find that here:
http://yuml.me and http://askuml.com
As a pet project, I was thinking about writing a program to migrate applications written in a language A into a language B.
A and B would be object-oriented languages. I suppose it is a very hard task : mapping language constructs that are alike is doable, but mapping libraries concepts will be a very long task.
I was wondering what tools to use, I know this has to do with compilation, but I'm a bit afraid to use Lex and Yacc and all that stuff.
I was thinking of maybe using the Eclipse Modeling Framework, which would help me write models (of application code) transformations in a readable form.
But first I would have to write parsers for creating the models (and also create the metamodel from the language grammar).
Are there tools that exist that would make my task easier?
You can use special transformation tools/languages for that TXL or Stratego/XT.
Also you can have a look and easily try Java to Python and Java to Tcl migrating projects made by me with TXL.
You are right about mapping library concepts. It is rather hard and long task. There are two ways here:
Fully migrate the class library from language A to B
Migrate classes/functions from language A to the corresponding concepts in language B
The approach you will choose depends on your goals and time/resources available. Also in many cases you wont be doing a general A->B migration which will cover all possible cases, you will need just to convert some project/library/etc. so you will see in your particular cases what is better to do with classes/libraries.
I think this is almost impossibly hard, especially as a personal project. But if you are going to do it, don't make life even more difficult for yourself by trying to come up with a general solution. Choose two specific real-life programming languages ind investigate the possibities of converting between them. I think you will be shocked by the number of problems and issues this will expose.
There are some tools for direct migration for some combinations of A and B.
There are a variety of reverse engineering and code generation tools for different languages and platforms. It's fairly rare to see reverse engineering tools which capture all the semantics of the source language, and the semantics of UML are not well defined ( since it's designed to map to different implementation languages, it itself doesn't define a complete execution model for its behavioural representations ), so you're unlikely to be able to reverse engineer and generate code between tools. You may find one tool that does full reverse engineering and full code generation for your A and B languages, and so may be able to get somewhere.
In general you don't use the same idioms on different platforms, so you're more likely to get something which emulates A code on B rather than something which corresponds to a native B solution.
If you want to use Java as the source language(that language you try to convert) than you might use Checkstyle AST(its used to write Rules). It gives you tree structure with every operation done in the source code. This will be much more easier than writing your own paser or using regex.
You can run com.puppycrawl.tools.checkstyle.gui.Main from checkstyle-4.4.jar to launch Swing GUI that parse Java Source Code.
Based on your comment
I'm not sure yet, but I think the source language/framework would be Java/Swing and the target some RIA language like Flex or a Javascript/Ajax framework. – Alain Michel 3 hours ago
Google Web Toolkit might be worth a look.
See this answer: What kinds of patterns could I enforce on the code to make it easier to translate to another programming language?