'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte - utf-8

I want to use Word2Vec, and i have download a Word2Vec's corpus in indonesian language, but when i call it, it was give me an error, this is what i try :
Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)
and it was give me an error, like this :
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-73-219e152ee7d9> in <module>()
----> 1 Model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Feature Extraction Lexicon Based/Word2Vec/idwiki_word2vec_100_new_lower.model.wv.vectors.npy', binary=True,)
2 frames
/usr/local/lib/python3.7/dist-packages/gensim/utils.py in any2unicode(text, encoding, errors)
353 if isinstance(text, unicode):
354 return text
--> 355 return unicode(text, encoding, errors=errors)
356
357
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

A file named idwiki_word2vec_100_new_lower.model.wv.vectors.npy is unlikely to be in the format needed by load_word2vec_format().
The .npy suggests it is a raw numpy array, which is not the format expected.
Also, the .wv.vectors. section suggests this could be part of a full, multi-file Gensim .save() of a complete Word2Vec model. That's more than just the vectors, & requires all associated files to re-load.
You should double-check the source of the vectors and what their claims are about its format and the proper ways to load. (If you're still having problems & need more guidance, you should specify more details about the origin of the file – for example a link to the website where it was obtained – to support other suggestions.)

Related

EOFError when converting gensim word2vec to binary format

I have a pretrained embeddings with word2vec format in txt. I loaded it and then saved it to .bin. But I cannot load this embeddings as an EOFError: unexpected end of input; is count incorrect or file otherwise damaged?
My original code is:
model = KeyedVectors.load_word2vec_format(wordfile)
model.save_word2vec_format("file.bin",binary=True,write_header=True)
bin_model = KeyedVectors.load_word2vec_format("file.bin",binary=True)
And I can load this file.bin with a limit arguement: KeyedVectors.load_word2vec_format("file.bin",binary=True, limit=10000).
Is there some other process needed when I save embeddings?
There's a good chance that your .bin file has an incorrect leading-count, or the file has been otherwise been damaged/truncated – because that error means the file declared in its header (1st line) a larger number of word-vectors than were found during attempted-load.
So, if you downloaded it or copied it from somewhere, check the original source, to make sure you've got the full file.
Is there a reason you're performing this conversion? The formats are essentialy equivalent, and result in the exact same object-in-Python after loading.
If there's any tiny on-disk size savings in binary-format, you could probably save more by GZIPping the file (which the .load_word2vec_format() will also happily decompress, if it sees a trailing .gz on the filename).

How to load pre-trained glove model with gensim load_word2vec_format?

I am trying to load a pre-trained glove as a word2vec model in gensim. I have downloaded the glove file from here. I am using the following script:
from gensim import models
model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)
but get the following error
ValueError Traceback (most recent call last)
<ipython-input-38-e0b48b51f433> in <module>()
1 from gensim import models
----> 2 model = models.KeyedVectors.load_word2vec_format('glove.6B.300d.txt', binary=True)
2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <genexpr>(.0)
171 with utils.smart_open(fname) as fin:
172 header = utils.to_unicode(fin.readline(), encoding=encoding)
--> 173 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
174 if limit:
175 vocab_size = min(vocab_size, limit)
ValueError: invalid literal for int() with base 10: 'the'
What is the underlying problem? Does gensim need a specific format to be able to load it?
The GLoVe format is slightly different – missing a 1st-line declaration of vector-count & dimensions – than the format that load_word2vec_format() supports.
There's a glove2word2vec utility script included you can run once to convert the file:
https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Also, starting in Gensim 4.0.0 (currentlyu in prerelease testing), the load_word2vec_format() method gets a new optional no_header parameter:
https://radimrehurek.com/gensim/models/keyedvectors.html?highlight=load_word2vec_format#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format
If set as no_header=True, the method will deduce the count/dimensions from a preliminary scan of the file - so it can read a GLoVe file with that option – but at the cost of two full-file reads instead of one. (So, you may still want to re-save the object with .save_word2vec_format(), or use the glove2word2vec script, to make future loads faster.)

How to interpret a binary Bitmap picture, so I know which color on which pixel i change when changing something in the Code?

I just used this code to convert a picture to binary:
import io
tme = input("Name: ")
with io.open(tme, "rb") as se:
print(se.read())
se.close()
Now it looks like this:
5MEMMMMMMMMMMMMM777777777777777777777777777777777\x95\x95\x95\x95\x95\x95\x95\x95\x95\x95\x95\x95\x95MEEMMMMEEMM\x96\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97\x97
And now I want to be able to interpret what this binary code is telling me exactly... I know it ruffly but not enough to be able to do anything on purpose. I searched the web but I didn't find anything that could help me on that point. Can you tell me how is it working or send me a link where I can read how it's done?
You can't just change random bytes in an image. There's a header at the start with the height and width, probably the date and a palette and information about the number of channels and bits per pixel. Then there is the image data which is often padded, and/or compressed.
So you need an imaging library like PIL/Pillow and code something like this:
from PIL import Image
im = Image.open('image.bmp').convert('RGB')
px = im.load()
# Look at pixel[4,4]
print (px[4,4])
# Make it red
px[4,4] = (255,0,0)
# Save to disk
im.save('result.bmp')
Documentation and examples available here.
The output should be printed in hexadecimal format. The first bytes in bitmap file are 'B' and 'M'.
You are attempting to print the content in ASCII. Moreover, it doesn't show the first bytes because the content has scrolled further down. Add print("start\n") to make sure you see the start of the output.
import io
import binascii
tme = 'path.bmp'
print("start") # make sure this line appears in console output
with io.open(tme, "rb") as se:
content = se.read()
print(binascii.hexlify(content))
Now you should see something like
start
b'424d26040100000000003...
42 is hex value for B
4d is hex value for M ...
The first 14 bytes in the file are BITMAPFILEHEADER
The next 40 bytes are BITMAPINFOHEADER
The bytes after that are color table (if any) and finally the actual pixels.
See BITMAPFILEHEADER and BITMAPINFOHEADER

Image Copy Issue with Ruby File method each_byte

This problem has bugged me for a while.
I have a jpeg file that is 34.6 kilobytes. Let's call it Image A. Using Ruby, when I copy each line of Image A to a newly created file, called Image B, it is copied exactly. It is exactly the same size as Image A and is accessible.
Here is the code I used:
image_a = File.open('image_a.jpg', 'r')
image_b = File.open('image_b.jpg', 'w+')
image_a.each_line do |l|
image_b.write(l)
end
image_a.close
image_b.close
This code generates a perfect copy of image_a into image_b.
When I try to copy Image A into Image B, byte by byte, it copies successfully but the file size is 88.9 kilobytes rather than the 34.6 kilobytes. I can't access Image B. My mac system alerted me it may be damaged or is using a file format that isn't recognized.
The related code:
//same as before
image_a.each_byte do |b|
image_b.write(b)
end
//same as before
Why is Image B, when copied into byte by byte, larger than Image A? Why is it also damaged in some way, shape, or form? Why is Image A the same size as B, when copied line by line, and accessible?
My guess is the problem is an encoding issue. If so, Why does encoding format matter when copying byte by byte if they translate into the correct code points? Are code points jumbled up into each other so the parser is unable to differentiate between them?
Do \s and \n matter? It seems like it. I did some more research and I found that Image A had 128 lines of code whereas Image B had only one line.
Thanks for reading!
IO#each_byte iterates over bytes (aka Integers). IO#write, however, takes a string as an argument. So it converts the integer to a string via to_s.
Given the first byte in your image is 2551, you'd write the string "255" into image_b. This is why your image_b gets larger. You write number-strings into it.
Try the following when writing back bytes:
image_a.each_byte do |l|
image_b.write l.chr
end
1 As #stefan pointed out jpeg images start with FF D8. So the first byte is 255.

Reading MP3 audio data, or calculating the checksum thereof

Is there a Ruby library that will allow me to either calculate the checksum of an MP3 file's audio data (minus the metadata) or allow me to read in an MP3's audio data to calculate the checksum myself?
I'm looking for something like this:
mp3 = Mp3Lib::MP3.new('/path/to/song.mp3')
mp3.audio.sha1sum # => the sha1 checksum of _only_ the audio, minus the metadata
I found Mp3Info, but it seems a bit tedious. When initializing an Mp3Info object, you can get the frames where the actual audio data begins and ends.
Extracting the mp3file without it's metadata is fairly easy done by yourself.
ID3v1
The metadata are the last 128 bytes of the file. The metadata always begins with the 3 bytes "TAG" if it exists. Just ignore this last 128bytes.
ID3v2
The metadata can be stored at the beginning or the end of the file. Most implemantations only support the beginning. ID3v2 has a header where the size is stored. The header is always loacted at the beginning of the metadata. There is an optional footer, which is a copy of the header at the end of the metadata. If the metadata is at the end of the file, the footer is required.
The header has the folloing form
ID3v2/file identifier "ID3"
ID3v2 version $04 00
ID3v2 flags %abcd0000
ID3v2 size 4 * %0xxxxxxx
The footer has the following form
ID3v2/file identifier "3DI"
ID3v2 version $04 00
ID3v2 flags %abcd0000
ID3v2 size 4 * %0xxxxxxx
The d bit says, wheter the footer is present. The size is measured without header and footer. Every byte of the size has always the highest bit set. So only 28 of the acutal 32 bits represent the size.
Just compute, which part of the file is not the metadata, and use it for your hashing.
Be aware, if both ID3v1 and ID3v2 are located at the end of the file, ID3v1 is located behind IDv2
The spec can be found at http://www.id3.org/id3v2.4.0-structure.
Isn't the ID3 tag stored either at the end of the file (ID3 v1) in a 128-byte block, or in a block at the start of the file (ID3v2.3 and v2.4) ? (id3.org)
You could use the audio_content method from Mp3Info, and read that much data from the file, though it's probably not much more complicated to have a look in the file yourself and work out where the headers aren't.

Resources