Identifying duplicates in specific CSV output - ruby

Ruby newbie here. I've got a product csv where first col is a unique SKU and second col is a product ID that can be duplicated across multiple products (+ many other cols but these are the pertinent ones). Like:
SKU | Prod ID
99 | 10384
100 | 10385
101 | 10385
102 | 10386
103 | 10386
104 | 10387
In the script I'm writing, the first time a product ID is used will become a 'parent', and any subsequent instances of the product ID get treated differently (ie, different sizes).
Currently am reading in the whole CSV rather than doing foreach line as I assumed I'd need all the data available to find the duplicates.
Issue is I'm not sure on the how to be able to identify the first time a product ID is used and then identifying any further instances of it's use.
My first thought was to somehow identify the duplicates (uniq?) and then create a new column and put a 1 if it's the first time it's occurred and 0 if it's occurred previously. After looking at uniq I'm not sure how I then go back to the main list and mark my 1's and 0's.
Can someone please point me in the direction of the classes/methods I need to be looking at?
Thanks,
Liam
Edit for John D: This gives me the hashes but in 1:1 format not 1: all instances of prod ID
CSV.foreach(INPUT, :headers => true , :header_converters => :symbol, :col_sep => "|", :quote_char => "\x00") do |csv_obj|
items[csv_obj.fields[0]] = [csv_obj.fields[1]]
end
so gives;
"230709"=>["88507"], "109064"=>["9019"]

You're thinking of the Sku as the unique identifier, which it may in fact be. But if you turn that on it's head and think of the ProductID as the unique identifier, then you can build a Hash where the key is the ProductID and the value is an Array of Skus. Then you'll be able to track which Skus are associated with which ProductID.
Of course you'll read this in some other way, but the end result would be similar to:
products =
{
10384 => [99],
10385 => [100, 101],
10386 => [102, 103],
10387 => [104]
}
Here's an example of how to construct this Hash:
#!/usr/bin/env ruby
require 'csv'
source = [
"99|110384",
"100|10385",
"101|10385",
"102|10386",
"103|10386",
"104|10387"
].join("\n")
source = CSV.parse(source, :col_sep => "|")
hh = source.inject({}) do |memo, row|
sku = row[0]
prod = row[1]
memo[prod] = [] unless memo.include?(prod)
memo[prod] << sku
memo
end
puts hh

.group_by() is relatively new (though it has an older counterpart in Rails), but is awfully convenient and should do most of your heavy lifting.
If you create a class to hold each row and put them in an Array, then you can call the group_by method with a block that just checks each object's Product ID field.
That gives you a Hash, which you can iterate through with .keys.each.
Assuming a whole bunch of things about your program that are hopefully semi-obvious, something like:
transactionHash = transactions.group_by { |x| x.productId }
Then, you can go through your transaction lists per product with:
transactionHash.each do |prodId,transList|
# transList has all of your transaction objects per product
end
Again, that assumes you're keeping your transactions in a list of objects. The x.productId would be something like x[1] if you store each transaction in an array, for example.

Related

How to sort what you've sorted from a .csv file

How can I unique-sort a .csv file first by ID then by PRICE, and then, if possible, by DATE. Here is an example file:
"Date","other","Id","other","Price"
"01/01/2016","3","6452312546-232","a","4.5"
"01/03/2016","1","4375623416-345","b","56.25"
"01/03/2016","6","4375623416-345","c","0"
"01/03/2016","5","4375623416-345","d","0"
"02/01/2016","4","6452312546-232","e","34.21"
I want the output to sort by ID first, so that everything is grouped by ID first, then once they're grouped, sort the individual groups by PRICE, and then sort the now PRICE-sorted groups by most recent date in the group. So I'd get this as output:
"Date","other","Id","other","Price"
"02/01/2016","4","6452312546-232","e","34.21"
"01/01/2016","3","6452312546-232","a","4.5"
"01/03/2016","1","4375623416-345","b","56.25"
"01/03/2016","6","4375623416-345","c","0"
"01/03/2016","5","4375623416-345","d","0"
Is that clear? Let me know if it's not.
Assuming this CSV file is small enough to safely load into memory, you can read the file into a hash and sort it from there.
require 'csv'
table = CSV.read('file.csv', headers: true).map { |row| Hash[row] }
table.sort_by { |k, _| [k["id"], k["price"].to_f, Date.parse(k["date"])] }
Although, if you're initially sorting by ID you're going to lose any granularity of the price and date fields further down the line.

Which is the most used name?

I am working on a ruby on rails site and I want to check its database for which is the most frequent name among the registered users.
There is a row called "First Name" for which I will go through. I don't mind about case sensitive right now.
Any convenient way to for example check what is the most popular name and then the second most popular, the third most popular and so on?
What I thought of is to get all users in an array and then do #users.each do |user|, then record the names in an array and after that to count the duplicates of each record that has more than one element recorded. I am not sure if its the proper way though.
Here is how you can do it using ActiveRecord:
User.group(:first_name).order('popularity desc').pluck(:first_name, 'count(*) as popularity')
This code translates to the SQL:
SELECT "users.first_name", count(*) as popularity FROM "users"
GROUP BY first_name
ORDER BY popularity
and you get something like:
[["John", 2345], ["James", 1986], ["Sam", 1835], ...]
If you want only the top ten names, you can limit the number of results simply by adding limit:
User.group(:first_name).order('popularity desc').limit(10).pluck(:first_name, 'count(*) as popularity')
Another option is to use the count API:
User.group(:first_name).count
=> {"Sam" => 1835, "Stefanos" => 2, ...}
# ordered
User.group(:first_name).order('count_all desc').count
=> {"John" => 2345, "James" => 1986, "Sam" => 1835, ...}
# top 3
User.group(:first_name).order('count_all desc').limit(3).count
=> {"John" => 2345, "James" => 1986, "Sam" => 1835 }
You could do the following SQL statement
select count(*) as count from users group by users.first_name order by count desc
Will return you the top most results. As Boris said, using just sql is the right way to go here.
Otherwise if you want to load all the users, you could do so by map-reduce.
#users.group_by(&:first_name).sort(&:count).reverse
Will give you an array of users sorted descending by their names.
Another way using ActiveRecord:
User.group(:first_name).count
Generated SQL is:
SELECT COUNT(*) AS count_all, name AS name FROM `users` GROUP BY name
Will output a hash of { name => number_of_occurances } e.g
{"John" => 29, "Peter" => 87, "Sarah" => 2}

Using Ruby to tag records that contain repeat phrases in a table

I'm trying to use Ruby to 'tag' records in a CSV table, based on whether or not a particular field contains a certain phrase that is repeated. I'm not sure if there are libraries to assist with this kind of job, and I recognize that Ruby might not be the most efficient language to do this sort of thing.
My CSV table contains a unique ID and a text field that I want to search:
ID,NOTES
1,MISSING DOB; ID CANNOT BE BLANK
2,INVALID MEMBER ID - unable to verify
3,needs follow-up
4,ID CANNOT BE BLANK-- additional info needed
From this CSV table, I've extracted keywords and assigned them a tag, which I've stored in another CSV table.
PHRASE,TAG
MISSING DOB,BLANKDOB
ID CANNOT BE BLANK,BLANKID
INVALID MEMBER ID,INVALIDID
Note that the NOTES column in my source contains punctuation and other phrases in addition to the phrases I have identified and want to map. Additionally, not all records have phrases that will match.
I want to create a table that looks something like this:
ID, TAG
1, BLANKDOB
1, BLANKID
2, INVALIDID
4, BLANKID
Or, alternately with the tags delimited with another character:
ID, TAG
1, BLANKDOB; BLANKID
2, INVALIDID
4, BLANKID
I have loaded the mapping table into a hash, with the phrase as the key.
phrase_hash = {}
CSV.foreach("phrase_lookup.csv") do |row|
phrase, tag = row
next if name == "PHRASE"
phrase_hash[phrase] = tag
end
The keys of the hash are then the search phrases that I want to iterate through. I'm having trouble expressing what I want to do next in Ruby, but here's the idea:
Load the NOTES table into an array. For each phrase (i.e. key), select the records from the array that contain the phrase, gather the IDs associated with these rows, and output them with the associated tag for that phrase, as above.
Can anyone help?
I'll give you an example using hash inputs instead of CSV:
notes = { 1 => "MISSING DOB; ID CANNOT BE BLANK",
2 => "INVALID MEMBER ID - unable to verify",
3 => "needs follow-up",
4 => "ID CANNOT BE BLANK-- additional info needed"
}
tags = { "MISSING DOB" => "BLANKDOB",
"ID CANNOT BE BLANK" => "BLANKID",
"INVALID MEMBER ID" => "INVALIDID"
}
output = {}
tags.each_pair do |tags_key,tags_value|
notes.each_pair do |notes_key, notes_value|
if notes_value.match(tags_key)
output[notes_key] ||= []
output[notes_key] << tags_value
end
end
end
puts output.map {|k,v| "#{k}, #{v.join("; ")}"}.sort

Queries on ActiveRecord Association collection object

I have a set of rows which I've fetched from a table. Let's say the object Rating. After fetching this object, I have say 100 entries from the database.
The Rating object might look like this:
table_ratings
t.integer :num
So what I now want to do is perform some calculations on these 100 rows without performing any other queries. I can do this, running an additional query:
r = Rating.all
good = r.where('num = 2') # this runs an additional query
"#{good}/#{r.length}"
This is a very rough idea of what I want to do, and I have other more complex output to build. Let's imagine I have over 10 different calculations I need to perform on these rows, and so aliasing these in a sql query might not be the best idea.
What would be an efficient way to replace the r.where query above? Is there a Ruby method or a gem which would give me a similar query api into ActiveRecord Association collection objects?
Rating.all returns an array of all Rating objects. From there, shift your focus to selecting and mapping from the array. eg.:
#perfect_ratings = r.select{|x| x.a == 100}
See:
http://www.ruby-doc.org/core-1.9.3/Array.html
ADDITIONAL COMMENTS:
Going over the list of methods available for array, I find myself using the following frequently:
To check a variable against multiple values:
%w[dog cat llama].include? #pet_type # returns true if #pet_type == 'cat'
To create another array(map and collect are aliases):
%w[dog cat llama].map(|pet| pet.capitalize) # ["Dog", "Cat", "Llama"]
To sort and drop duplicates:
%w[dog cat llama dog].sort.uniq # ["cat", "dog", "llama"]
<< to add an element, + to add arrays, flatten to flatten embedded arrays into a single level array, count or length or size for number of elements, and join are the others I tend to use a lot.
Finally, here is an example of join:
%w[dog cat llama].join(' or ') # "dog or cat or llama"

Complex date find and inject

I am building a financial reporting app with Ruby on Rails. In the app I have monthly financial statement objects (with 'revenue' as an attribute). For any single financial statement, I want show (1) year-to-date revenue, and (2) last-year-to-date revenue (calendar years are fine). Each object also has a Date (not Datetime) attribute called 'month' (watch out for 'month' variable name vs. 'month' method name confusion...maybe I should change that variable name).
So...
I think I need to (1) 'find' the array of financial statements (i.e., objects) in the appropriate date range, then (2) sum the 'revenue' fields. My code so far is...
def ytd_revenue
# Get all financial_statements for year-to-date.
financial_statements_ytd = self.company.financial_statements.find(:all,
:conditions => ["month BETWEEN ? and ?", "self.month.year AND month.month = 1",
"self.month.year AND self.month.month" ])
# Sum the 'revenue' attribute
financial_statements_ytd.inject(0) {|sum, revenue| sum + revenue }
end
This does not break the app, but returns '0' which cannot be correct.
Any ideas or help would be appreciated!
This statement may do what you want:
financial_statements_ytd.inject(0) {|sum, statement| sum + statement.revenue }
You can also look into ActiveRecord's sum class method - you can pass in the field name and conditions to get the sum.
What is the name of the field in financial_statement object that holds the value you want?
Supposing that the field name is ammount then just modify the inject statement to be:
financial_statements_ytd.inject {|sum, revenue| sum + revenue.ammount }

Resources