EOFError when converting gensim word2vec to binary format

EOFError when converting gensim word2vec to binary format - gensim

I have a pretrained embeddings with word2vec format in txt. I loaded it and then saved it to .bin. But I cannot load this embeddings as an EOFError: unexpected end of input; is count incorrect or file otherwise damaged?
My original code is:
model = KeyedVectors.load_word2vec_format(wordfile)
model.save_word2vec_format("file.bin",binary=True,write_header=True)
bin_model = KeyedVectors.load_word2vec_format("file.bin",binary=True)
And I can load this file.bin with a limit arguement: KeyedVectors.load_word2vec_format("file.bin",binary=True, limit=10000).
Is there some other process needed when I save embeddings?

There's a good chance that your .bin file has an incorrect leading-count, or the file has been otherwise been damaged/truncated – because that error means the file declared in its header (1st line) a larger number of word-vectors than were found during attempted-load.
So, if you downloaded it or copied it from somewhere, check the original source, to make sure you've got the full file.
Is there a reason you're performing this conversion? The formats are essentialy equivalent, and result in the exact same object-in-Python after loading.
If there's any tiny on-disk size savings in binary-format, you could probably save more by GZIPping the file (which the .load_word2vec_format() will also happily decompress, if it sees a trailing .gz on the filename).

Related

What is wrong with this PDF file?

I have to work with a PDF form created by a person unknown to me. Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF? There is no visual break between the three parts.
/TT1 1 Tf
11.04 0 0 11.04 59.16 476.1203 Tm
(Datum)Tj
/C2_1 1 Tf
<0003>Tj
/TT1 1 Tf
(der)Tj
0.424 -1.315 Td
(Tätigkeit)Tj
-0.0022 Tc 0 11.04 -11.04 0 261.24 437.7203 Tm
[(Ve)-4.6<7267fc74>-4.2(ungssat)-4.2(z)]TJ
/C2_1 1 Tf
0 Tc <0003>Tj
/TT1 1 Tf
-0.0021 Tc 0.935 -1.315 Td
[<2880>-6.1(/)-7.2(S)0.8(t)-4.1(unde)-4.5(\))]TJ % <<< the important line
0 Tc 11.04 0 0 11.04 340.92 468.8003 Tm
(Anlass/Art)Tj
/C2_1 1 Tf
resulting in
[]
To get the source code above, I decoded the PDF file as described here. I have no know-how concerning the PDF file format.
Background: I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code, since no free PDF editor seems to be able to work with horizontal text without problems.
Academic Bonus questions: Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.) Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?

Why did the program with which the form was created (Word + PDF export?) split the term "Stunde" into "S", "t" and "unde" in line 6909 of the decoded PDF?
As #gettalong mentioned in his answer, in your case this most likely has been done to apply kerning.
If you start looking into the outputs of some other PDF producers, you'll see that this export from Word actually is very unobtrusive in regard to splitting words:
there are PDF producers that draw each character individually after explicitly setting the text matrix for it, and
there also are PDF producers that have the width information for the characters of the used fonts set to zero and use the numbers in TJ instructions to forward the current text matrix between characters accordingly.
And this doesn't cover all the variants to be found, not by far...
Thus,
I had to replace the word "Stunde", it drove me crazy to find the place where "Stunde" was written (in parts) within the source code
in your case replacing actually was a fairly trivial task...
Is it possible to set the sum over a column as default value for a form field? (Modifiable; changed every time the column is changed.)
If all the column values in question are stored in form fields, you can use JavaScript to recalculate sums after form changes. To have it serve as "default" only, you can use some other (hidden) field for a flag whether the field has already been touched. Beware, though: JavaScript is not supported by all PDF viewers. Furthermore, the JavaScript object model for PDF is not specified in an independent (like ISO) specification but in an Adobe one which can make interpretation of the specification biased.
Why was I able to replace "Stunde" with "Einsatz" without making the PDF file corrupt due to now irregular offsets?
As we don't know how exactly you applied the changes, this obviously is hard to tell.
Most likely, though, you did corrupt the PDF and the PDF viewers you opened it in merely repair the corruption under the hood. There is a strong tendency in PDF viewers to do such under-the-hood repairs without informing the user; the result is that a large part of the PDFs in the wild actually being broken.

You don't see a visual break but the standard distance between "S", "t" and "unde" has been changed nonetheless. This is done by PDF writers that support e.g. kerning so that the word appear nicer. This is the reason why it is split that way.

Load Doc2Vec without the docs vectors only for infer_vector

I have big gensim Doc2vec model, I only need to infer vectors while i am loading the training documents vectors from other source.
Is it possible to load it as is without the big npy file
I did
Edit:
from gensim.models.doc2vec import Doc2Vec
model_path = r'C:\model/model'
model = Doc2Vec.load(model_path)
model.delete_temporary_training_data(keep_doctags_vectors=False, keep_inference=True)
model.save(model_path)
remove the files (model.trainables.syn1neg.npy,model.wv.vectors.npy) manually
model = Doc2Vec.load(model_path)
but it ask for
Traceback (most recent call last):
File "<ipython-input-5-7f868a7dbe0c>", line 1, in <module>
model = Doc2Vec.load(model_path)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\doc2vec.py", line 1113, in load
return super(Doc2Vec, cls).load(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\base_any2vec.py", line 1244, in load
model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\models\base_any2vec.py", line 603, in load
return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 427, in load
obj._load_specials(fname, mmap, compress, subname)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 458, in _load_specials
getattr(self, attrib)._load_specials(cfname, mmap, compress, subname)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\gensim\utils.py", line 469, in _load_specials
val = np.load(subname(fname, attrib), mmap_mode=mmap)
File "C:\ProgramData\Anaconda3\envs\py\lib\site-packages\numpy\lib\npyio.py", line 428, in load
fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\model/model.trainables.syn1neg.npy'
Note:
Those files not exists in the directory,
The model run on a server and download the model file from the storage
My question is, Do the model must have those files for inference?
I want to run it as low memory consumption as possible.
Thanks.
Edit:
Is the file model.trainables.syn1neg.npy is the model weights?
Is the file model.wv.vectors.npy is necessary for running an inference?

I'm not a fan of the delete_temporary_training_data() method. It implies there's a clearer separation between training-state and that needed for later uses. (Inference is very similar to training, though it doesn't need the cached doc-vectors for training texts.)
That said, if you've used that method, you shouldn't then be deleting any of the side-files that were still part of the save. If they were written by .save(), they'll be expected, by name, by the .load(). They must be kept with the main model file. (There might be fewer such files, or smaller such files, after the delete_temporary_training_data() call - but any written must be kept for reading.)
The syn1neg file is absolutely required for inference: it's the model's hidden-to-output weights, needed to perform new forward-predictions (and thus also backpropagated inference-adjustments). The wv.vectors file is definitely needed in default dm=1 mode, where word-vectors are part of the doc-vector calculation. (It might be optional in dm=0 mode, but I'm not sure the code is armored against them being absent - not via in-memory trimming, and definitely not against the expected file being deleted out-of-band.)

For anyone who wants to load gensim.Doc2Vec without doc vectors for inference, I found this works:
import gensim
fname = "path/to/model.pickle"
d2v = gensim.utils.unpickle(fname)
d2v.__recursive_saveloads = ["wv"] # Would normally be ['wv', 'dv']
d2v._load_specials(fname, None, *gensim.utils.SaveLoad._adapt_by_suffix(fname))
I know it's a pretty ugly hack, but in my case the doc vectors ended up being a 28GB file, and I didn't want to retrain.

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.

Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Image Copy Issue with Ruby File method each_byte

This problem has bugged me for a while.
I have a jpeg file that is 34.6 kilobytes. Let's call it Image A. Using Ruby, when I copy each line of Image A to a newly created file, called Image B, it is copied exactly. It is exactly the same size as Image A and is accessible.
Here is the code I used:
image_a = File.open('image_a.jpg', 'r')
image_b = File.open('image_b.jpg', 'w+')
image_a.each_line do |l|
image_b.write(l)
end
image_a.close
image_b.close
This code generates a perfect copy of image_a into image_b.
When I try to copy Image A into Image B, byte by byte, it copies successfully but the file size is 88.9 kilobytes rather than the 34.6 kilobytes. I can't access Image B. My mac system alerted me it may be damaged or is using a file format that isn't recognized.
The related code:
//same as before
image_a.each_byte do |b|
image_b.write(b)
end
//same as before
Why is Image B, when copied into byte by byte, larger than Image A? Why is it also damaged in some way, shape, or form? Why is Image A the same size as B, when copied line by line, and accessible?
My guess is the problem is an encoding issue. If so, Why does encoding format matter when copying byte by byte if they translate into the correct code points? Are code points jumbled up into each other so the parser is unable to differentiate between them?
Do \s and \n matter? It seems like it. I did some more research and I found that Image A had 128 lines of code whereas Image B had only one line.
Thanks for reading!

IO#each_byte iterates over bytes (aka Integers). IO#write, however, takes a string as an argument. So it converts the integer to a string via to_s.
Given the first byte in your image is 2551, you'd write the string "255" into image_b. This is why your image_b gets larger. You write number-strings into it.
Try the following when writing back bytes:
image_a.each_byte do |l|
image_b.write l.chr
end
1 As #stefan pointed out jpeg images start with FF D8. So the first byte is 255.

Reading MP3 audio data, or calculating the checksum thereof

Is there a Ruby library that will allow me to either calculate the checksum of an MP3 file's audio data (minus the metadata) or allow me to read in an MP3's audio data to calculate the checksum myself?
I'm looking for something like this:
mp3 = Mp3Lib::MP3.new('/path/to/song.mp3')
mp3.audio.sha1sum # => the sha1 checksum of _only_ the audio, minus the metadata
I found Mp3Info, but it seems a bit tedious. When initializing an Mp3Info object, you can get the frames where the actual audio data begins and ends.

Extracting the mp3file without it's metadata is fairly easy done by yourself.
ID3v1
The metadata are the last 128 bytes of the file. The metadata always begins with the 3 bytes "TAG" if it exists. Just ignore this last 128bytes.
ID3v2
The metadata can be stored at the beginning or the end of the file. Most implemantations only support the beginning. ID3v2 has a header where the size is stored. The header is always loacted at the beginning of the metadata. There is an optional footer, which is a copy of the header at the end of the metadata. If the metadata is at the end of the file, the footer is required.
The header has the folloing form
ID3v2/file identifier "ID3"
ID3v2 version $04 00
ID3v2 flags %abcd0000
ID3v2 size 4 * %0xxxxxxx
The footer has the following form
ID3v2/file identifier "3DI"
ID3v2 version $04 00
ID3v2 flags %abcd0000
ID3v2 size 4 * %0xxxxxxx
The d bit says, wheter the footer is present. The size is measured without header and footer. Every byte of the size has always the highest bit set. So only 28 of the acutal 32 bits represent the size.
Just compute, which part of the file is not the metadata, and use it for your hashing.
Be aware, if both ID3v1 and ID3v2 are located at the end of the file, ID3v1 is located behind IDv2
The spec can be found at http://www.id3.org/id3v2.4.0-structure.

Isn't the ID3 tag stored either at the end of the file (ID3 v1) in a 128-byte block, or in a block at the start of the file (ID3v2.3 and v2.4) ? (id3.org)
You could use the audio_content method from Mp3Info, and read that much data from the file, though it's probably not much more complicated to have a look in the file yourself and work out where the headers aren't.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio