Extracting vectors from Doc2Vec - gensim

I am trying to extract the documents vector to feed into a regression model for prediction.
I have fed around 1 400 000 of labelled sentences into doc2vec for training, however I was only able to retrieve only 10 vectors using model.docvecs.
This is a snapshot of the labelled sentences I used to trained the doc2vec model:
In : documents[0]
Out: TaggedDocument(words=['descript', 'yet'], tags='0')
In : documents[-1]
Out: TaggedDocument(words=['new', 'tag', 'red', 'sparkl', 'firm', 'price', 'free', 'ship'], tags='1482534')
These are the code used to train the doc2vec model
model = gensim.models.Doc2Vec(min_count=1, window=5, size=100, sample=1e-4, negative=5, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples =len(documents), epochs=1)
This is the dimension of the documents vectors:
In : model.docvecs.doctag_syn0.shape
Out: (10, 100)
On which part of the code did I mess up?
Update:
Adding on to the comment from sophros, it appear that i have made a mistake when I am creating the TaggedDocument prior to training which resulted in 1.4 mil Documents appearing as 10 Documents.
Courtesy of Irene Li on your tutorial on Doc2vec, I have made some slightly edit to the class she used to generate TaggedDocument
def get_doc(data):
tokenizer = RegexpTokenizer(r'\w+')
en_stop = stopwords.words('english')
p_stemmer = PorterStemmer()
taggeddoc = []
texts = []
for index,i in enumerate(data):
# for tagged doc
wordslist = []
tagslist = []
i = str(i)
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# remove numbers
number_tokens = [re.sub(r'[\d]', ' ', i) for i in stopped_tokens]
number_tokens = ' '.join(number_tokens).split()
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in number_tokens]
# remove empty
length_tokens = [i for i in stemmed_tokens if len(i) > 1]
# add tokens to list
texts.append(length_tokens)
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
taggeddoc.append(td)
return taggeddoc
The mistake was fixed when I made the change from
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
to this
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),[str(index)])
It appear that the index of the TaggedDocument must be in the form of the list for TaggedDocument to work properly. For more details as to why, please refer to this answer by gojomo.

The gist of the error was: the tags for each individual TaggedDocument were being provided as plain strings, like '101' or '456'.
But, tags should be a list-of-separate tags. By providing a simple string, it was treated as a list-of-characters. So '101' would become ['1', '0', '1'], and '456' would become ['4', '5', '6'].
Across any number of TaggedDocument objects, there were thus only 10 unique tags, single digits ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']. Every document just caused some subset of those tags to be trained.
Correcting tags to be a list-of-one tag, eg ['101'], allows '101' to be seen as the actual tag.

Related

Elixir/Phoenix sum of the column

I'm trying to get the sum of the particular column.
I have a schema of orders, with the field total, that stores the total price.
Now I'm trying to created a query that will sum total value of all the orders, however not sure if I'm doing it right.
Here is what i have so far:
def create(conn, %{"statistic" => %{"date_from" => %{"day" => day_from, "month" => month_from, "year" => year_from}}}) do
date_from = Ecto.DateTime.cast!({{year_from, month_from, day_from}, {0, 0, 0, 0}})
revenue = Repo.all(from p in Order, where: p.inserted_at >= ^date_from, select: sum(p.total))
render(conn, "result.html", revenue: revenue)
end
And just calling it like <%= #revenue %> in the html.eex.
As of right now, it doesn't return errors, just renders random symbol on the page, instead of the total revenue.
I think my query is wrong, but couldn't find good information about how to make it work properly. Any help appreciated, thanks!
Your query returns just 1 value, and Repo.all wraps it in a list. When you print a list using <%= ... %>, it treats integers inside the list as Unicode codepoints, and you get the character with that codepoint as output on the page. The fix is to use Repo.one instead, which will return the value directly, which in this case is an integer.
revenue = Repo.one(from p in Order, where: p.inserted_at >= ^date_from, select: sum(p.total))
#Dogbert's answer is correct. It is worth noting that if you are using Ecto 2.0 (currently in release candidate) then you can use Repo.aggregate/4:
revenue = Repo.aggregate(from p in Order, where: p.inserted_at >= ^date_from, :sum, :total)

Slicing in PyTables

What is the fastest way to slice arrays that are saved in h5 using PyTables?
The scenario is the following:
The data was already saved (no need to optimize here):
filters = tables.Filters(complib='blosc', complevel=5)
h5file = tables.open_file(hd5_filename, mode='w',
title='My Data',
filters=filters)
group = h5file.create_group(h5file.root, 'Data', 'Data')
X_atom = tables.Float32Atom(shape=[50,50,50])
X = h5file.create_carray(group, 'X', atom=X_atom, title='XData',
shape=(1000,), filters=filters)
The data is opened :
h5file = tables.openFile(hd5_filename, mode="r")
node = h5file.getNode('/', data_node)
X = getattr(node, X_str)
This is where I need optimization, I need to make a lot of the following kind of array slicing that cannot be sorted, for many many indexes and different min/max locations:
for index, min_x, min_y, min_z, max_x, max_y, max_z in my_very_long_list:
current_item = X[index][min_x:max_x,min_y:max_y,min_z:max_z]
do_something(current_item)
The question is:
Is this the fastest way to do the task?

Rails where not equals *any* of array of values

I've tried to do stuff like this
not_allowed = ['5', '6', '7']
sql = not_allowed.map{|n| "col != '#{n}'"}.join(" OR ")
Model.where(sql)
and
not_allowed = ['5', '6', '7']
sql = not_allowed.map{|n| "col <> '#{n}'"}.join(" OR ")
Model.where(sql)
but both of these just return my entire table which isn't accurate.
So I've done this and it works:
shame = values.map{|v| "where.not(:col => '#{v}')" }.join(".")
eval("Model.#{shame}")
and I'm not even doing this for an actual web application, I'm just using rails for its model stuff. So there aren't any actual security concerns for me. But this is an awful fix and I felt obligated to post this question
Your first pieces of code do not work because the OR condition is making the entire where clause be always true. That is, if the value of col is 5, 5 is not different than 5, but it is different than 6 and 7, therefore, the where clause is evaluating as: false OR true OR true which returns true.
I think in this case you can use the NOT IN clause instead, as follows:
not_allowed = ['1','2', '3']
Model.where('col not in (?)', not_allowed)
This will return all records except the ones where col matches any of the elements in your array.

Write actual values to bar chart using Gruff within Ruby

I am generating a bar chart with values [1,5,10,23]. Currently, I have no way of knowing those exact values when looking at the image generated by Gruff. I just know that 23 falls somewhere between the lines of 20 and 25.
Is it possible to write the exact values within the image?
I think you are looking for labels
g = Gruff::Bar.new
g.title = 'Wow! Look at this!'
g.data = "something", [1,5,10,23]
g.labels = { 0 => '1', 1 => '5', 2 => '10', 3 => '23'}
Read the documentation for more info on labels
I think this is what you are looking for:
g.show_labels_for_bar_values = true

Searching CSV with 1.6 Million lines (150MB) file?

I have a CSV containing 1.6 million lines of data and at around 150MB, it contains product data. I have another CSV containing 2000 lines, which contains a list of product in the big CSV. They relate to each other by a unique id. The idea is to add the product data in the CSV with 2000 lines.
The databank.csv has headers ID, Product Name, Description, Price .
The sm_list.csv has header ID.
The result is to output a csv with products in sm_list.csv, with the corresponding data in databank.csv... 2000 rows long.
My original solution reads in all of the sm_list, and reads databank line by line. It searches sm_list for the ID in the line read in from databank. This leads to 2000x1.6Million = 3200 million comparisons!
Could you please provide a basic algorithm outline to complete this task in the most efficient way?
Assuming you know to how read/write CSV files in MATLAB (several questions here on SO shows how), here is an example:
%# this would be read from "databank.csv"
prodID = (1:10)'; %'
prodName = cellstr( num2str(prodID, 'Product %02d') );
prodDesc = cellstr( num2str(prodID, 'Description %02d') );
prodPrice = rand(10,1)*100;
databank = [num2cell(prodID) prodName prodDesc num2cell(prodPrice)];
%# same for "sm_list.csv"
sm_list = [2;5;7;10];
%# find matching rows
idx = ismember(prodID,sm_list);
result = databank(idx,:)
%# ... export 'result' to CSV file ...
The result of the above example:
result =
[ 2] 'Product 02' 'Description 02' [19.251]
[ 5] 'Product 05' 'Description 05' [14.651]
[ 7] 'Product 07' 'Description 07' [4.2652]
[10] 'Product 10' 'Description 10' [ 53.86]
have to be using matlab? If you just input all that data into a database, it'll be easier. A simple select tableA.ID, tableB.productname... where tableA.id = tableB.id will do it.

Resources