Embeddings (word or other) standard file format - word-embedding

I am creating my own word embeddings and I have various versions of them.
What is the standard way (if there is one) to save embeddings to files, so that others could easily read and use them later?
If there are several accepted methods, I'd appreciate an answer that explains each method.

I have found how standard text format of word embeddings look like
<vocabulary_length> <embedding_dimensions>
<word1> <emb1_dim1> <emb1_dim2> ...... <emb1_dim_n>
<word2> <emb2_dim1> <emb2_dim2> ...... <emb2_dim_n>
.
.
<word_m> <embm_dim1> <embm_dim2> ...... <embm_dim_n>
Where in this example vocabulary_length is m and embedding_dimensions is n

Related

Best Data Structure For Text Processing

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):
{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}
A courageous feline was recently spotted in the neighborhood protecting her mini kitten
How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:
A fearless cat was recently spotted in the neighborhood protecting her little cat
The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well.
Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure.
For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword
edit: Refined the question
Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.
So organise your dictionary in the opposite sense:
d = {
'feline': 'cat',
'kitten': 'cat'
}
To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:
import re
regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")
s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))
The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

total_words must be provided alongside corpus_file argument

I am training doc2vec with corpus file, which is very huge.
model = Doc2Vec(dm=1, vector_size=200, workers=cores, comment='d2v_model_unigram_dbow_200_v1.0')
model.build_vocab(corpus_file=path)
model.train(corpus_file=path, total_examples=model.corpus_count, epochs=model.iter)
I want to know how to get value of total_words.
Edit:
total_words=model.corpus_total_words
Is this right?
According to the current (gensim 3.8.1, October 2019) Doc2Vec.train() documentation, you shouldn't need to supply both total_examples and total_words, only one or the other:
To support linear learning-rate decay from (initial) alpha to
min_alpha, and accurate progress-percentage logging, either
total_examples (count of documents) or total_words (count of raw words
in documents) MUST be provided. If documents is the same corpus that
was provided to build_vocab() earlier, you can simply use
total_examples=self.corpus_count.
But, it turns out the new corpus_file option does require both, and the doc-comment is wrong. I've filed a bug to fix this documentation oversight.
Yes, the model caches the number of words observed during the most-recent build_vocab() inside model.corpus_total_words, so total_words=model.corpus_total_words should do the right thing for you.
When using the corpus_file space-delimited text input option, then the numbers given by corpus_count and corpus_total_words should match the line- and word- counts you'd also see by running wc your_file_path at a command-line.
(If you were using the classic, plain Python iterable corpus option (which can't use threads as effetively), then there would be no benefit to supplying both total_examples and total_words to train() – it would only use one or the other for estimating progress.)

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Script for Large CSV File to Convert IPv6 addresses to Number (or String)

So I have a big csv file, over 1gb. There's a column with IP addresses in ipv4 and ipv6. I want to convert the ipv6 addresses into numbers, but there are too many rows for libre calc. So I'm wondering if it's possible to use python in the terminal to convert all the ipv6 addresses.
Also, I could split the file up into smaller pieces, then use libre calc, but same problem--I wouldn't know how to script that either.
EDIT:
I don't mind, it might get more complicated though. Also not sure how this should be formatted, but I hope people get the idea...So I have one table with IPv6 addresses like these examples:
2001:db8::cafe:1111
2001:db8:0:a:1:2:3:4
2001:db8:aaaa::c
2001:db8:0:0:1::4
There are a bunch of different rules that govern the formatting--way too hard for me. I've heard that python has a function that will specifically return the conversion, but not sure about the rest (how to get the returned values back into the csv correctly, with formatting unbroken, etc.). Anyway, here's a row from the other table:
"58569107296622255421594597096899477504","58569107375850417935858934690443427839","NG","Nigeria","Abuja Federal Capital Territory","Abuja","9.057350","7.489760"
So the part I need to match is the first two numbers (first two columns), where there are several ranges from
"0","340282366920938463463374607431768211455"
So I wanted to take the IPv6 addresses, convert them to IP numbers, then sort them into their respective ranges.
Yes, this is something you can do in Python. I'll demonstrate with a few short snippets and links to documentation that will fall short of a full solution in favor of empowering you with the resources that you need to put the pieces together yourself.
First off, if you want to load one CSV file line-by-line and write to a second one this is how you would do it:
>>> import csv
>>> with open('eggs.csv', newline='') as in and open('omellette.csv', 'w') as out:
... r = csv.reader(in)
... w = csv.writer(out)
... for row in r:
... print(', '.join(row)) # print unmodified
... row[0] = ipToNum(row[0])
... row[1] = ipToNum(row[1])
... print(', '.join(row)) # print modified
... w.writerow(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam
The original on which this example was based and additional information about python's built-in CSV capabilities can be found here:
https://docs.python.org/3/library/csv.html
You will probably need to make adjustments depending on the exact formatting of your particular CSV file. Now, to convert IP addresses to numbers you can do something like the following:
import socket, struct
def ipToNum(ip):
"convert ipv4/6 string to long integer"
return struct.unpack('>L',socket.inet_pton(ip))[0]
def numToDottedip(n):
"convert long int to ipv4/6"
return socket.inet_ntop(struct.pack('>L',n))
This example is adapted from what I found here:
https://www.oreilly.com/library/view/python-cookbook/0596001673/ch10s06.html
You will have to modify it
Also, if you want to learn more about the socket and struct modules here is the documentation:
https://docs.python.org/3/library/socket.html
https://docs.python.org/3/library/struct.html
You shouldn't need to split the file up since the CSV reader object will only return one line at a time rather than reading in the whole file at once. Of course, you also probably want to actually do something with those numbers once you've read them in but since you didn't specify I'll figuring that out to you.
Also note that I haven't tried any of this code. It's worth repeating here in the form of a metaphor: I'm trying to teach you to fish rather than just giving you fish. It's in your best interest to take this advice and wrestle with getting it to work yourself as that would be your first step toward actually being a programmer.

Wiktionary/MediaWiki Search & Suffix Filtering

I'm building an application that will hopefully use Wiktionary words and definitions as a data source. In my queries, I'd like to be able to search for all Wiktionary entries that are similar to user provided terms in either the title or definition, but also have titles ending with a specified suffix (or one of a set of suffixes).
For example, I want to find all Wiktionary entries that contain the words "large dog", like this:
https://en.wiktionary.org/w/api.php?action=query&list=search&srsearch=large%20dog
But further filter the results to only contain entries with titles ending with "d". So in that example, "boarhound", "Saint Bernard", and "unleashed" would be returned.
Is this possible with the MediaWiki search API? Do you have any recommendations?
This is mostly possible with ElasticSearch/CirrusSearch, but disabled for performance reasons. You can still use it on your wiki, or attempt smart search queries.
Usually for Wiktionary I use yanker, which can access the page table of the database. Your example (one-letter suffix) would be huge, but for instance .*hound$ finds:
Afghan_hound
Bavarian_mountain_hound
Foxhound
Irish_Wolfhound
Mahound
Otterhound
Russian_Wolfhound
Scottish_Deerhound
Tripehound
basset_hound
bearhound
black_horehound
bloodhound
boarhound
bookhound
boozehound
buckhound
chowhound
coon_hound
coonhound
covert-hound
covert_hound
coverthound
deerhound
double-nosed_andean_tiger_hound
elkhound
foxhound
gazehound
gorehound
grayhound
greyhound
harehound
heckhound
hell-hound
hell_hound
hellhound
hoarhound
horehound
hound
limehound
lyam-hound
minkhound
newshound
nursehound
otterhound
powder_hound
powderhound
publicity-hound
publicity_hound
rock_hound
rockhound
scent_hound
scenthound
shag-hound
sighthound
sleuth-hound
sleuthhound
slot-hound
slowhound
sluthhound
smooth_hound
smoothhound
smuthound
staghound
war_hound
whorehound
wolfhound

Resources