gensim w2k - additional file - gensim

I trained w2v on rather big (> 200 million sentences) corpus, and got, in addition to file w2v_model.model, files: w2v_model.model.trainables.syn1neg.npy and w2v.model_model.wv.vectors.npy. Model file was successfully loaded and read all npy files without any exceptions. The obtained model performed OK.
Now I retrained the model on much bigger corpus (> 1 billion sentences). The same 3 files were automatically saved, as expected.
When I try to load my new retrained model:
w2v_model = Word2Vec.load(path_filename)
I get:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/...../w2v_US.model.trainables.vectors_lockf.npy'
But no .npy file with such extension was saved by gensim at the end of the training
(I save all output files in the same library, as required).
What should I do to obtain such file as a part of output .npy files (may be some option in gensim w2v when training)? May be there are other ways to overcome this issue?

If a .save() is creating any files with the word trainables in it, you're using a older version fo Gensim. Any new training should definitely prefer using a current version. As of now (January 2022), that's gensim-4.1.2, released 2021-09.
If an attempt at a .load() generated that particular error, then there should've been that file, alongside the others you mention, created when the .save() had been done. (In fact, the only way that the main file you named with path_filename should be able to know that other filename is if that other file was written successfully, allowing the main file to complete writing.)
Are you sure that file wasn't written, but then somehow left behind, perhaps getting deleted or not moving alongside the other few files to some new filesystem path?
In general, I would suggest:
using latest Gensim for any new training
always enable Python logging at the INFO level, & watch the logging/console output of training/saving processes closely to see confirmation of expected activity/steps
keep all files from a .save() that begin with the same main filename (in your examples above, w2v_US.model) together - & keep in mind that for larger models it may be a larger roster of files than for a small test model
You will probably have to re-train the model, but you might be able to re-generate a compatible lockf file via steps like the following:
save aside all files of any potential use
from the exact same configuration as your original .save() – including the same outdated Gensim version, exact same model parameters, & exact same training corpus – repeat all the model-building steps you did before up through the .build_vocab() step. (That is: no extra need to .train().) This will create an untrained dummy model that should exactly match the vocabulary 'shape' of your broken model.
use .save() to save that dummy model again - watching the logs/output for errors. There should be, alongside the other files, a file with a name like dummy.model.trainables.vectors_lockf.npy. If so, you might be able to copy that away, rename it to tbe the file expected by the original model whose load failed, then leave it alongside that original model - and the .load() might then succeed, or fail in a different way.
(If there were other problems/corruption at the time of the original model creation, this might not work. In particular, I wonder if when you talk about retraining the model, you didn't start with a fresh Word2Vec instance, but somehow expanded the older one, which might've added other problems/complications. In that case, a full retraining, ideally in the latest Gensim, would be necessary, and also a better basis for going forward.)

Related

Can't load the pre-trained word2vec of korean language

I would like to download and load the pre-trained word2vec for analyzing Korean text.
I download the pre-trained word2vec here: https://drive.google.com/file/d/0B0ZXk88koS2KbDhXdWg1Q2RydlU/view?resourcekey=0-Dq9yyzwZxAqT3J02qvnFwg
from the Github Pre-trained word vectors of 30+ languages: https://github.com/Kyubyong/wordvectors
My gensim version is 4.1.0, thus I used:
KeyedVectors.load_word2vec_format('./ko.bin', binary=False) to load the model. But there was an error that :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I already tried many options including in stackoverflow and Github, but it still not work well.
Would you mind letting me the suitable solution?
Thanks,
While the page at https://github.com/Kyubyong/wordvectors isn't clear about the formats this author has chosen, by looking at their source code at...
https://github.com/Kyubyong/wordvectors/blob/master/make_wordvectors.py#L61
...shows it using the Gensim model .save() method.
Such saved models should be reloaded using the .load() class method of the same model class. For example, if a Word2Vec model was saved with...
model.save('language.bin')
...then it could be reloaded with...
loaded_model = Word2Vec.load('language.bin')
Note, through, that:
Models saved this way are often split over multiple files that should be kept together (and all start with the same root name) - but I don't see those here.
This work appears to be ~5 years old, based on a pre-1.0 version of Gensim – so there might be issues loading the models directly into the latest Gensim. If you do run into such issues, & absolutely need to make these vectors work, you might need to temporarily use a prior version of Gensim to .load() the model. Then, you could save the plain vectors out with .save_word2vec_format() for later reloading across any version. (Or, using the latest interim version that can load the model, re-save the model as .save(), then repeat the process with the latest version that can read that model, until you reach the current Gensim.)
But, you also might want to find a more recent & better-documented set of pretrained word-vectors.
For example, Facebook makes FastText pretrained vectors available in both a 'text' format and a 'bin' format for many languages at https://fasttext.cc/docs/en/pretrained-vectors.html (trained on Wikipedia only) or https://fasttext.cc/docs/en/crawl-vectors.html (trained on Wikipedia plus web crawl data).
The 'text' format should in fact be loadable with KeyedVectors.load_word2vec_format(filename, binary=False), but will only include full-word vectors. (It will also be relatively easy to view as text, or write simply code to massage into other formats.)
The 'bin' format is Facebook's own native FastText model format, and should be loadable with either the load_facebook_model() or load_facebook_vectors() utility methods. Then, the loaded model (or vectors) will be able to create the FastText algorithm's substring-based guesstimate vectors even for many words that weren't in the model or training data.

Nvidia Digits accuracy and loss plots data

I trained my model in Nvidia Digits 5 and I would now like to extract the accuracy and loss plots that were generated during training for a report. Is this data saved somewhere so that it would possible to extract the data for these plots so that I could plot it in Python and perhaps ultimately modify the plots to compare different models etc?
The best solution I have found is to either look at the HTML file or to scan the text file caffe_output.log that is produced by Caffe. The text file is usually stored in /var/digits/jobs/insert_your_job_id/ but you can also just run on linux systems:
locate caffe_output.log
Go to your DIGITS job folder and locate your job's subfolder. Inside you'll find a file status.pickle, which is a pickled object containing all your job's information.
You can load it in python like so:
import digits
import pickle
data = pickle.load(open('status.pickle','rb'))
This object is somewhat generic and may contain multiple tasks. For a typical classification task it will likely be just one, but you will still need to access it via data.tasks[0]. From there you can grab the plots:
data.tasks[0].combined_graph_data()
which returns a somewhat convoluted dict (unfortunately - since your network can produce many accuracy/loss outputs, as well as even custom ones). It contains everything you need though - I managed to plot accuracy with:
plt.plot( data.tasks[0].combined_graph_data()['columns'][2][1:] )
but it's likely that you'll have to write a bit of custom code. As always, dir() is your friend.

How can I resolve MSI paths in VBScript?

I am looking for resolving the paths of files in an MSI with or without installing (whichever is faster) from outside an MSI, using VBScript.
I found a similar query using C# instead and Christpher had provided a solution, below: How can I resolve MSI paths in C#?
I am going through the very same pain now but is there anyway to achieve this using WindowsInstaller object in VBScript, rather than go with endless queries through SQL Tables of MSI back and forth to achieve the same. Though any direction would be welcoming as I have tried tested whatever I can with very limited success.
yes there is a solution without installing the msi and using vbscript.
there is a very good example in the Windows Installer SDK called "WiFilVer.vbs"
using that example i've thrown together an quick example script that does exactly what you need.
set installer = CreateObject("WindowsInstaller.Installer")
const READONLY = 0
set db = installer.OpenDataBase("<FULL PATH TO YOUR MSI>", READONLY)
set session = installer.OpenPackage(db, READONLY)
session.DoAction("CostInitialize")
session.DoAction("CostFinalize")
set view = db.OpenView("SELECT File, Directory_, FileName, Component_, Component FROM File,Component WHERE Component=Component_ ORDER BY Directory_")
view.Execute
set record = view.Fetch
do until record is nothing
file = record.StringData(1)
directoryName = record.StringData(2)
fileName = record.StringData(3)
if instr(fileName, "|") then fileName = split(fileName, "|")(1)
wsh.echo(session.TargetPath(directoryName) & fileName)
set record = view.Fetch
loop
just add the path to your MSI file.
tell me if you need a more detailed answer. i will have some more time to answer this in detail this evening.
EDIT the promised background (and why i need to call ConstFinalize)
naveen actually MSDN was the only resource that can give an definitive answer on this, but you need to know where and how to look since windows installer ist IMHO a pretty complex topic.
I really recommend a mix of the msdn installer function reference, the database reference, and the examples from the windows installer SDK (sorry couldn't find a download link, i think its somewhere hidden in the like 3GB windows SDK)
first you need general knowledge of MSIs:
an MSI is actually a relational database.
Everything is stored in tables that relate to each other.
(actually not everything, but i will try to keep it simple ;))
This database is interpreted by the Windows Installer,
this creates a 'Session'
also some parts are dynamically resolved, depending on the system you install the msi on,
like 'special' folders similar to environment variables.
E.g. msi has a "ProgramFilesFolder", where windows generally has %ProgramFiles%.
All dynamic stuff only exists in the Installer session, not the database itself.
In your case there are 3 tables you need to look at, take care of the relations and resolve them.
the 'File' table contains all Files, the 'Component' table tells you which file goes into which directory and the 'Directory' table contains all information about the filesystem structure.
Using a SQL Query i could link the Component and File table to find the directory name (or primary key in database jargon).
But the directory table has relations in itself, its structured like a tree.
take a look at this example directory table (taken from the instEd MSI)
The columns are Directory, Directory_Parent and DefaultDir
InstEdAllUseAppDat InstEdAppData InstEd
INSTALLDIR InstEdPF InstEd
CUBDIR INSTALLDIR hkyb3vcm|Validation
InstEdAppData CommonAppDataFolder instedit.com
CommonAppDataFolder TARGETDIR .
TARGETDIR SourceDir
InstEdPF ProgramFilesFolder instedit.com
ProgramFilesFolder TARGETDIR .
ProgramMenuFolder TARGETDIR .
SendToFolder TARGETDIR .
WindowsFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 TARGETDIR Win
SystemFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 WindowsFolder_x86_VC.1DEE2A86_2F57_3629_8107_A71DBB4DBED2 System
The directory_parent links it to a directory. the DefaultDir contains the actual name.
You could now resolve the tree by yourself and replace all special folders(which in a vbscript would be very tedious)...
...or let the windows installer handle that (just like when installing a msi).
now i have to introduce a new thing: Actions (and Sequences):
when running (installing, removing, repairing) an msi a defined list of actions is performed.
some actions just collect information, some change the actual database.
there are list of actions (called sequences) for various things a msi can do,
like one sequence for installing (called InstallExecuteSequence), one for collecting information from the user (the UI of the MSI: InstallUISequence) or one for adminpoint installations(AdminExecuteSequence).
in our case we don't want to run a whole sequence (which could alter the system or just take to long),
luckily the windows installer lets us run single actions without running a whole sequence.
reading the reference of the directory table on MSDN (the remarks section) you can see which action you need:
Directory resolution is performed during the CostFinalize action
so putting all this together the script is easier to read
* open the msi file
* 'parse' it (providing the session)
* query component and file table
* run the CostFinalize action to resolve directory table (without running the whole MSI)
* get the resolved path with the targetPath function
btw i found the targetPath function by browsing the Installer Reference on MSDN
also i just noticed that CostInitialize is not required. its only required if you want to get the sourcePath of a file.
I hope this makes everything clearer, its very hard to explain since it took me like half a year to understand it myself ;)
And regarding PhilmEs answer:
Yes there are more influences to the resolution of the directory table, like custom actions.
keeping that in mind also the administrative installation might result in different directorys (eg. because different sequence might hold different custom actions).
Components have conditions so maybe a file is not installed at all.
Im pretty sure InstEd doesnt take custom actions into account either.
So yes, there is no 100% solution. Maybe a mix of everything is necessary.
The script given by weberik (deriven from MS SDK VB code) is very good, because it makes it easy to analyse the directory table without an own algorithm (which is a mid-size effort to do it in a loop or with a recursion algorithm).
But it gives not a 100% perfect view for all files, see below.
The method of the script is semi-dynamic (can be extended by other actions), but in effect it gives only the static directory structure, similar to a default administrative install or advanced MSI viewers.
Normally this is enough and what we want.
But be aware, that this is not the 100% solution (Knowing before exact the path of each file afterwards). That does mean, this will not give you always the correct paths in some cases:
You use command line parameters which substitute predefined directory table entries.
The MSI uses custom actions which change paths.
Especially it is not guaranteed, that every file is installed. There may be optional conditions and features and this may depend on the install environment.
In effect, a 100% solution is very very hard to achieve, without installing really. It comes near to reprogram nearly the whole windows installer engine. So simplifications are normally sufficient and accepted.
But you can extend the method to cover custom actions, e.g. with adding a line "session.DoAction(..)" for each additional action needed. Or to include command line parameters.
Sometimes it can be tricky. The easier the structure of the MSI is, the more likely it is that you succeed without more efforts.
Alternative to write an own program:
The question is, what you really want to find out, and if it is really necessary to program it:
If you don't want to write an automatical every-day MSI analyzer maybe the following is sufficient for you:
First tip: install the MSI with "msiexec /a mysetup.msi TARGETDIR="c:\mytestpath" . (similar restrictions as script above by weberik)
If the MSI has not used custom actions to change pathes including forgetting to add to the admin sequence ("forgetting" should be taken as the normal case for 99% or existing setups :-), you get the filestructure like if you install "really" with some special namings for the Windows predefined folders which you will find out easily.
If the administrative install lacks some folders, it is often a better idea of fixing the custom action (adding to the admin sequence) and using this scenario as your primary test case.
The advantage is, that only you limit the dynamics used by admin install. BTW, you can use the same command line params or path settings custom actions as in real install.
Second tip: Google for the InstEd tool , go to the file or component table and you will see the resulting MSI paths in the same static way as with the mentioned VB-script after calling CostInitialize/CostFinalize. For human view such an editor view maybe better.
For automatic testing and improvements or accuracy, you need an own program of course.
For those of you that mentioned snippet given is a good starting point. :-)
The rest of you should live easier with one of the two given methods without programming.

With CRF++, MIRA works for me but CRF-L1 and CRF-L2 do not

It may not matter, but I am using the windows distribution of CRF++ 0.58.
So I have successfully used mallet to train a model with a CRF and then test it. When I try to use the same train and test files with CRF++ (and after creating a template file), I get a
The line search routine mcsrch failed: error code:0
error when I use either
-a CRF-L1
or the default
-a CRF-L2
When I use
-a MIRA
though, training works without error and same with test.
The format of the test and training data can be the same for both mallet and crf++, so that is not the issue. My template file is as simple as
#Mixed
M00:%x[0,0]
M01:%x[0,1]
M02:%x[0,2]
......
M12:%x[0,12]
My last column is either 0 or 1 in my training data which is the value to classify with. No whitespace in any of my features, I use underscores when necessary. Am I missing something simple here, what would cause the L1 and L2 regularizations to fail like that?
I knew it was something silly ...
To use features like I am using, you need to use the U prefix (as in Unigram). So like U00:%x[0,0] is fine. You can't just call you features anything you want.
I also discovered that if I stripped down my test data to a single sentence, I would get the same error message. When I restored my test data back to its original size of around 2600 sentences, the regularization algorithms now run. Overfitting is a common cause of this error message across various nlp and ml applications, but that was not the true problem in my case.
It can also happen in the extreme case of a dataset with just one CLASS (due to bug in the training set generation procedure).

How do you manage the String Translation Process?

I am working on a Software Project that needs to be translated into 30 languages. This means that changing any string incurs into a relatively high cost. Additionally, translation does not happen overnight, because the translation package needs to be worked by different translators, so this might take a while.
Adding new features is cumbersome somehow. We can think up all the Strings that will be needed before we actually code the UI, but sometimes still we need to add new strings because of bug fixes or because of an oversight.
So the question is, how do you manage all this process? Any tips in how to ease the impact of translation in the software project? How to rule the strings, instead of having the strings rule you?
EDIT: We are using Java and all Strings are internationalized using Resource Bundles, so the problem is not the internationalization per-se, but the management of the strings.
I'm not sure the platform you're internationalizing in. I've written an answer before on the best way to il8n an application. See What do I need to know to globalize an asp.net application?
That said - managing the translations themselves is hard. The problem is that you'll be using the same piece of text across multiple pages. Your framework may not, however, support only having that piece of text in one file (resource files in asp.net, for instance, encourage you to have one resource file per language).
The way that we found to work with things was to have a central database repository of translations. We created a small .net application to import translations from resource files into that database and to export translations from that database to resource files. There is, thus, an additional step in the build process to build the resource files.
The other issue you're going to have is passing translations to your translation vendor and back. There are a couple ways for this - see if your translation vendor is willing to accept XML files and return properly formatted XML files. This is, really, one of the best ways, since it allows you to automate your import and export of translation files. Another alternative, if your vendor allows it, is to create a website to allow them to edit the translations.
In the end, your answer for translations will be the same for any other process that requires repetition and manual work. Automate, automate, automate. Automate every single thing that you can. Copy and paste is not your friend in this scenario.
Pootle is an webapp that allows to manage translation process over the web.
There are a number of major issues that need to be considered when internationalizing an application.
Not all strings are created equally. Depending upon the language, the length of a sentence can change significantly. In some languages, it can be half as long and in others it can be triple the length. Make sure to design your GUI widgets with enough space to handle strings that are larger than your English strings.
Translators are typically not programmers. Do not expect the translators to be able to read and maintain the correct file formats for resource files. You should setup a mechanism where you can transform the translated data round trip to your resource files from something like an spreadsheet. One possibility is to use XSL filters with Open Office, so that you can save to Resource files directly in a spreadsheet application. Also, translators or translation service companies may already have their own databases, so it is good to ask about what they use and write some tools to automate.
You will need to append data to strings - don't pretend that you will never have to or you will always be able to put the string at the end. Make sure that you have a string formatter setup for replacing placeholders in strings. Furthermore, make sure to document what are typical values that will be replaced for the translators. Remember, the order of the placeholders may change in different languages.
Name your i8n string variables something that reflects their meaning. Do you really want to be looking up numbers in a resource file to find out what is the contents of a given string. Developers depend on being able to read the string output in code for efficiency a lot more than they often realize.
Don't be afraid of code-generation. In my current project, I have written a small Java program that is called by ant that parses all of the keys of the default language (master) resource file and then maps the key to a constant defined in my localization class. See below. The lines in between the //---- comments is auto-generated. I run the generator every time I add a string.
public final class l7d {
...normal junk
/**
* Reference to the localized strings resource bundle.
*/
public static final ResourceBundle l7dBundle =
ResourceBundle.getBundle(BUNDLE_PATH);
//---- start l7d fields ----\
public static final String ERROR_AuthenticationException;
public static final String ERROR_cannot_find_algorithm;
public static final String ERROR_invalid_context;
...many more
//---- end l7d fields ----\
static {
//---- start setting l7d fields ----\
ERROR_AuthenticationException = l7dBundle.getString("ERROR_AuthenticationException");
ERROR_cannot_find_algorithm = l7dBundle.getString("ERROR_cannot_find_algorithm");
ERROR_invalid_context = l7dBundle.getString("ERROR_invalid_context");
...many more
//---- end setting l7d fields ----\
}
The approach above offers a few benefits.
Since your string key is now defined as a field, your IDE should support code completion for it. This will save you a lot of type. It get's really frustrating looking up every key name and fixing typos every time you want to print a string.
Someone please correct me if I am wrong. By loading all of the strings into memory at static instantiation (as in the example) will result in a quicker load time at the cost of additional memory usage. I have found the additional amount of memory used is negligible and worth the trade off.
The localised projects I've worked on had 'string freeze' dates. After this time, the only way strings were allowed to be changed was with permission from a very senior member of the project management team.
It isn't exactly a perfect solution, but it did enable us to put defects regarding strings on hold until the next release with a valid reason. Once the string freeze has occured you also have a valid reason to deny adding brand new features to the project on 'spur of the moment' decisions. And having the permission come from high up meant that middle managers would have no power to change specs on your :)
If available, use a database for this. Each string gets an id, and there is either a table for each language, or one table for all with the language in a column (depending on how the site is accessed the performance dictates which is better). This allows updates from translators without trying to manage code files and version control details. Further, it's almost trivial to run reports on what isn't translated, and keep track of what was an autotranslation (engine) vs a real human translation.
If no database, then I stick each language in a separate file so version control issues are reduced. But the structure is basically the same - each string has an id.
-Adam
Not only did we use a database instead of the vaunted resource files (I have never understood why people use something like that which is a pain to manage, when we have such good tools for dealing with databases), but we also avoided the need to tag things in the application (forgetting to tag controls with numbers in VB6 Forms was always a problem) by using reflection to identify the controls for translation. Then we use an XML file which translates the controls to the phrase IDs from the dictionary database.
Although the mapping file had to be managed, it could still be managed independent of the build process, and the translation of the application was actually possible by end-users who had rights in the database.
The solution we came up to so far is having a small application in Excel that reads all the property files, and then shows a matrix with all the translations (languages as headers, keys as rows). It is quite evident what is missing then. This is send to the translators. When it comes back, then the sheet can be processed to generate the same property bundles back again. So far it has eased the pain somewhat, but I wonder what else is around.
This google book - resource file management gives some good tips
You can use Resource File Management software to keep track of strings that have changed and control the workflow to get them translated - otherwise you end up in a mess of freezes and overbearing version control
Some tools that do this sort of thing - no connection an I haven't actually used them, just researching
http://www.sisulizer.com/
http://www.translationzone.com/en/products/
I put in a makefile target that finds all the .properties files and puts them in a zip file to send off to the translators. I offered to send them just diffs, but for some reason they want the whole bundle of files each time. I think they have their own system for tracking just differences, because they charge us based on how many strings have changed from one time to the next. When I get their delivery back, I manually diff all their files with the previous delivery to see if anything unexpected has changed - one time all the PT_BR (Brazillian Portuguese) strings changed, and it turns out they'd used a PT_PT (Portuguese Portuguese) translator for that batch in spite of the order for PT_BR.
In Java, internationalization is accomplished by moving the strings to resource bundles ... the translation process is still long and arduous, but at least it's separated from the process of producing the software, releasing service packs etc. One thing that helps is to have a CI system that repackages everything any time changes are made. We can have a new version tested and out in a matter of minutes whether it's a code change, new language pack or both.
For starters, I'd use default strings in case a translation is missing. For example, the English or Spanish value.
Secondly, you might want to consider a web app or something similar for your translators to use. This requires some resources upfront, but at least you won't need to send files around and it will be obvious for the translators which strings are new, etc.

Resources