I've tried to do stuff like this
not_allowed = ['5', '6', '7']
sql = not_allowed.map{|n| "col != '#{n}'"}.join(" OR ")
Model.where(sql)
and
not_allowed = ['5', '6', '7']
sql = not_allowed.map{|n| "col <> '#{n}'"}.join(" OR ")
Model.where(sql)
but both of these just return my entire table which isn't accurate.
So I've done this and it works:
shame = values.map{|v| "where.not(:col => '#{v}')" }.join(".")
eval("Model.#{shame}")
and I'm not even doing this for an actual web application, I'm just using rails for its model stuff. So there aren't any actual security concerns for me. But this is an awful fix and I felt obligated to post this question
Your first pieces of code do not work because the OR condition is making the entire where clause be always true. That is, if the value of col is 5, 5 is not different than 5, but it is different than 6 and 7, therefore, the where clause is evaluating as: false OR true OR true which returns true.
I think in this case you can use the NOT IN clause instead, as follows:
not_allowed = ['1','2', '3']
Model.where('col not in (?)', not_allowed)
This will return all records except the ones where col matches any of the elements in your array.
Related
I am trying to extract the documents vector to feed into a regression model for prediction.
I have fed around 1 400 000 of labelled sentences into doc2vec for training, however I was only able to retrieve only 10 vectors using model.docvecs.
This is a snapshot of the labelled sentences I used to trained the doc2vec model:
In : documents[0]
Out: TaggedDocument(words=['descript', 'yet'], tags='0')
In : documents[-1]
Out: TaggedDocument(words=['new', 'tag', 'red', 'sparkl', 'firm', 'price', 'free', 'ship'], tags='1482534')
These are the code used to train the doc2vec model
model = gensim.models.Doc2Vec(min_count=1, window=5, size=100, sample=1e-4, negative=5, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples =len(documents), epochs=1)
This is the dimension of the documents vectors:
In : model.docvecs.doctag_syn0.shape
Out: (10, 100)
On which part of the code did I mess up?
Update:
Adding on to the comment from sophros, it appear that i have made a mistake when I am creating the TaggedDocument prior to training which resulted in 1.4 mil Documents appearing as 10 Documents.
Courtesy of Irene Li on your tutorial on Doc2vec, I have made some slightly edit to the class she used to generate TaggedDocument
def get_doc(data):
tokenizer = RegexpTokenizer(r'\w+')
en_stop = stopwords.words('english')
p_stemmer = PorterStemmer()
taggeddoc = []
texts = []
for index,i in enumerate(data):
# for tagged doc
wordslist = []
tagslist = []
i = str(i)
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# remove numbers
number_tokens = [re.sub(r'[\d]', ' ', i) for i in stopped_tokens]
number_tokens = ' '.join(number_tokens).split()
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in number_tokens]
# remove empty
length_tokens = [i for i in stemmed_tokens if len(i) > 1]
# add tokens to list
texts.append(length_tokens)
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
taggeddoc.append(td)
return taggeddoc
The mistake was fixed when I made the change from
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
to this
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),[str(index)])
It appear that the index of the TaggedDocument must be in the form of the list for TaggedDocument to work properly. For more details as to why, please refer to this answer by gojomo.
The gist of the error was: the tags for each individual TaggedDocument were being provided as plain strings, like '101' or '456'.
But, tags should be a list-of-separate tags. By providing a simple string, it was treated as a list-of-characters. So '101' would become ['1', '0', '1'], and '456' would become ['4', '5', '6'].
Across any number of TaggedDocument objects, there were thus only 10 unique tags, single digits ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']. Every document just caused some subset of those tags to be trained.
Correcting tags to be a list-of-one tag, eg ['101'], allows '101' to be seen as the actual tag.
I'm trying to get the sum of the particular column.
I have a schema of orders, with the field total, that stores the total price.
Now I'm trying to created a query that will sum total value of all the orders, however not sure if I'm doing it right.
Here is what i have so far:
def create(conn, %{"statistic" => %{"date_from" => %{"day" => day_from, "month" => month_from, "year" => year_from}}}) do
date_from = Ecto.DateTime.cast!({{year_from, month_from, day_from}, {0, 0, 0, 0}})
revenue = Repo.all(from p in Order, where: p.inserted_at >= ^date_from, select: sum(p.total))
render(conn, "result.html", revenue: revenue)
end
And just calling it like <%= #revenue %> in the html.eex.
As of right now, it doesn't return errors, just renders random symbol on the page, instead of the total revenue.
I think my query is wrong, but couldn't find good information about how to make it work properly. Any help appreciated, thanks!
Your query returns just 1 value, and Repo.all wraps it in a list. When you print a list using <%= ... %>, it treats integers inside the list as Unicode codepoints, and you get the character with that codepoint as output on the page. The fix is to use Repo.one instead, which will return the value directly, which in this case is an integer.
revenue = Repo.one(from p in Order, where: p.inserted_at >= ^date_from, select: sum(p.total))
#Dogbert's answer is correct. It is worth noting that if you are using Ecto 2.0 (currently in release candidate) then you can use Repo.aggregate/4:
revenue = Repo.aggregate(from p in Order, where: p.inserted_at >= ^date_from, :sum, :total)
I am using OrderBy, and I have figured out that I have to use OrderBy as a last method, or it will not work. Distinct operator does not grant that it will maintain the original order of values, or if I use Include, it cannot sort the children collection.
Is there any reason why I shouldn't do Orderby always last and don't worry if order is preserved?
Edit:
In general, is there any reason, like performance impact, why I should not use OrderBy last. Doesnt metter if I use EnityFramework to query a database or just querying some collection.
dbContext.EntityFramework.Distinct().OrderBy(o=> o.Something); // this will give me ordered result
dbContext.EntityFramework.OrderBy(o=> o.Something).Distinct().; // this will not, because Distinct doesnt preserve order.
Lets say that I want to Select only one property.
dbContext.EntityFramework.Select(o=> o.Selected).OrderBy(o=> o.Something);
Will order be faster if I order collection after one property selection? So in that case I should use Order last. And I am just asking is there any situation where ordering shoudnt be done as last command?
Is there any reason why I shouldn't do OrderBy always last
There may be reasons to use OrderBy not as the last statement. For example, the sort property may not be in the result:
var result = context.Entities
.OrderBy(e => e.Date)
.Select(e => e.Name);
Or you want a sorted collection as part of the result:
var result = context.Customers
.Select(c => new
{
Customer = c,
Orders = c.Orders.OrderBy(o => o.Date)
Address = c.Address
});
Will order be faster if I order collection after one property selection?
Your examples show that you're working with LINQ to Entities, so the statements will be translated into SQL. You will notice that...
context.Entities
.OrderBy(e => e.Name)
.Select(e => e.Name)
... and ...
context.Entities
.Select(e => e.Name)
.OrderBy(s => s)
... will produce exactly the same SQL. So there is no essential difference between both OrderBy positions.
Doesn't matter if I use Entity Framework to query a database or just querying some collection.
Well, that does matter. For example, if you do...
context.Entities
.OrderBy(e => e.Date)
.Select(e => e.Name)
.Distinct()
... you'll notice that the OrderBy is completely ignored by EF and the order of names is unpredictable.
However, if you do ...
context.Entities
.AsEnumerable() // Continue as LINQ to objects
.OrderBy(e => e.Date)
.Select(e => e.Name)
.Distinct()
... you'll see that the sort order is preserved in the distinct result. LINQ to objects clearly has a different strategy than LINQ to Entities. OrderBy at the end of the statement would have made both results equal.
To sum it up, I'd say that as a rule of the thumb, try to order as late as possible in a LINQ query. This will produce the most predictable results.
I don't know if you misundertood the meaning of Distinct. According to definition it does:
Returns distinct elements from a sequence by using the default equality comparer to compare values.
So if you have a list of int and you want to remove repeated values, you use Distinct. Distinct uses the default equality comparer and it does the comparison by comparing the current element to the next one. So, you have to sort first to get the expected result.
And about OrderBy method, in fact, it does the sort. So if you want to sort something and distinct after you use:
List<int> myNumbers = new List<int>{ 102, 2817, 82, 2, 1, 2, 1, 9, 4 };
Sorting and removing duplicated numbers
// returns 1, 2, 4, 9, 82, 102, 2817
var sortedUniques = myNumbers.OrderBy(n => n).Distinct();
Removing duplicated numbers and sorting
// returns 1, 1, 2, 2, 4, 9, 82, 102, 2817
// It occurs because the Distinct compares current number to the next one
var sortedUniques = myNumbers.Distinct().OrderBy(n => n);
Just removing duplicated numbers
// returns 102, 2817, 82, 2, 1, 9, 4
var sortedUniques = myNumbers.Distinct().OrderBy(n => n);
Just sorting
// returns 1, 1, 2, 2, 4, 9, 82, 102, 2817
var sortedUniques = myNumbers.Distinct().OrderBy(n => n);
I hope it helps you \o/
I need to find and update a number of records in a Rails 3.2, Ruby 2 application. The following code successfully finds the records I want. What I need to do though is add " x" (including the space) to the email address of every user and I can't figure out how to do it.
This finds the records
User.joins(:account)
.where("users.account_id NOT IN (?)", [1955, 3083, 3869])
.where("accounts.partner_id IN (?)", [23,50])
.where("users.staff = '0'")
.where("users.admin = '0'")
.where("users.api_user = '0'")
.where("users.partner_id is null")
.update_all(email: :email.to_s << " X")
but it's the last line I'm having problems with. Is this possible, or do I need to find the records another way?
The update_all method updates a collection of records, but unless you write your own SQL expression, it can only set one value. For example, if you wanted to overwrite all the email addresses with "X", you could do it easily:
User.joins(:account)
.where("users.account_id NOT IN (?)", [1955, 3083, 3869])
# ...other scopes...
.update_all(email: "X")
In your case, what you really need to do is make individual updates to all these records. One way to do it is to find the records, then loop over them and update them one at a time:
users_to_update = User.joins(:account)
.where("users.account_id NOT IN (?)", [1955, 3083, 3869])
.where("accounts.partner_id IN (?)", [23,50])
.where("users.staff = '0'")
.where("users.admin = '0'")
.where("users.api_user = '0'")
.where("users.partner_id is null")
users_to_update.each do |user|
user.update_attribute(:email, "#{user.email} X")
end
Another solution would be to use a SQL expression with update_all, as in Zoran's answer.
Try writing the last line like so:
.update_all("email = email || ' X'")
This uses SQL's string concatenation operator to append the X to the end of the emails.
Hope that helps!
In the Sequel ORM for Ruby, the Dataset class has an all method which produces an Array of row hashes: each row is a Hash with column names as keys.
For example, given a table T:
a b c
--------------
0 22 "Abe"
1 35 "Betty"
2 58 "Chris"
then:
ds = DB['select a, b, c from T']
ah = ds.all # Array of row Hashes
should produce:
[{"a":0,"b":22,"c":"Abe"},{"a":1,"b":35,"c":"Betty"},{"a":2,"b":58,"c":"Chris"}]
Is there a way built in to Sequel to instead produce an Array of row Arrays, where each row is an array of only the values in each row in the order specified in the query? Sort of how select_rows works in ActiveRecord? Something like this:
aa = ds.rows # Array of row Arrays
which would produce:
[[0,22,"Abe"],[1,35,"Betty"],[2,58,"Chris"]]
Note: the expression:
aa = ds.map { |h| h.values }
produces an array of arrays, but the order of values in the rows is NOT guaranteed to match the order requested in the original query. In this example, aa might look like:
[["Abe",0,22],["Betty",1,35],["Chris",2,58]]
Old versions of Sequel (pre 2.0) had the ability in some adapters to return arrays instead of hashes. But it caused numerous issues, nobody used it, and I didn't want to maintain it, so it was removed. If you really want arrays, you need to drop down to the connection level and use a connection specific method:
DB.synchronize do |conn|
rows = conn.exec('SQL Here') # Hypothetical example code
end
The actual code you need will depend on the adapter you are using.
DB[:table].where().select_map(:id)
If you want just an array of array of values...
DB['select * from T'].map { |h| h.values }
seems to work
UPDATE given the updated requirement of the column order matching the query order...
cols= [:a, :c, :b]
DB[:T].select{cols}.collect{ |h| cols.collect {|c| h[c]}}
not very pretty but guaranteed order is the same as the select order.
There does not appear to be a builtin to do this.
You could make a request for the feature.
I haven't yet found a built-in method to return an array of row arrays where the values in the row arrays are ordered by the column order in the original query. The following function does* although I suspect an internal method could be more effecient:
def rows( ds )
ret = []
column_keys = ds.columns # guaranteed to match query order?
ds.all { |row_hash|
row_array = []
column_keys.map { |column_key| row_array << row_hash[column_key] }
ret << row_array
}
ret
end
*This function depends on the order of the array returned by Dataset.columns. If this order is undefined, then this rows function isn't very useful.
have you tried this?
ds = DB['select a, b, c from T'].to_a
not sure it it works but give it a shot.