Searching through a CSV with variable number of columns - ruby

I could use some help thinking about a puzzle I'm solving. I was going to tackle it in Ruby, but could be in another language like Javascript/Node. I'm needing help breaking out the problem and designing.
I'm currently working on a command-line program that reads that reads in a CSV, searches the CSV based on the arguments, and then produces output based on what it finds.
The CSV rows have one of two formats. One is simple is a list of restaurants, food items, and their prices:
restaurant ID, price, item label
But for restaurants offering combo meals, where there can be any number of
items in a value meal:
restaurant ID, price, item 1 label, item 2 label, ...
So the idea is that you could run this program with the arguments of the CSV file to read and the food items you want to eat, and it outputs the restaurant they should go to, and the total price it will cost them. It is okay to purchase extra items, as long as the total cost is minimized.
Sample data.csv
1, 4.00, burger
1, 8.00, tofu_log
2, 5.00, burger
2, 6.50, tofu_log
$ foodfinder.rb data.csv burger tofu_log
=> 2, 11.5
Likewise with the rows that have multiple food items:
5, 4.00, extreme_fajita
5, 8.00, fancy_european_water
6, 5.00, fancy_european_water
6, 6.00, extreme_fajita, jalapeno_poppers, extra_salsa
$ foodfinder.rb data.csv fancy_european_water extreme_fajita
=> 6, 11.0
Since data normalization isn't an option—I can't shove these into a DB—I was wondering how I might go about thinking about how to parse the CSV in an efficient way. Also that some rows have multiple food items has me unsure how to store those. I'm guessing I'd want to import the rows into a hash and then search through the hash in some fashion. Any guidance, wizards?

With Ruby, I'd skip the standard CSV libraries, and just load the rows, split them into, at most, three pieces, and convert the third into an array. From that point you have all you need:
records = file.map { |row|
row.split(/,\s?/, 3)
}.map { |arr|
[arr[0].to_i, arr[1].to_f, arr[2].split(/,\s?/)]
}
Now your records will be:
[
[5, 4.00, ["extreme_fajita"]],
[5, 8.00, ["fancy_european_water"]],
[6, 5.00, ["fancy_european_water"]],
[6, 6.00, ["extreme_fajita", "jalapeno_poppers", "extra_salsa"]]
]
You can use your knowledge on resolving NP-complete problems on this data, that has already the needed form.

That data is easy to parse with the standard CSV library and a tiny bit of array wrangling:
data = CSV.open('data.csv')
.map { |r| [ r[0], r[1], r[2..-1].map(&:strip) ] }
That gives you this in data:
data = [
['5', '4.00', ['extreme_fajita']],
['5', '8.00', ['fancy_european_water']],
#...
]
From there it is easy to build whatever indexed structure you need.
However, if you're just interested in finding rows with 'extra_salsa' then use select instead of map:
want = CSV.open('x.csv')
.select { |r| r[2..-1].map(&:strip).include?('extra_salsa') }
and clean up want for printing however you need to.
You're going to be spinning through the whole CSV every time your script runs so you should search it while you're scanning it, building intermediate indexed data structures is just a waste of time if you're only doing one search per run.

Related

Perfomance wise for LUA table selection

I'm a bit new to LUA. So I have a game that I need to capture the Entity and insert into the table. The maximum possible Entity table that could happen at the same time is 14. So I read that an array based solution is good.
But I saw that the table size increment even if we delete some value, for example from 10 table value and delete value at index 9 its not automatically shift the size when I want to insert table number 11.
Example:
local Table = {"hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello"}
-- Current Table size = 10
-- Perform delete at index 9
Table[9] = nil
-- Have new Entity to insert
Table[#Table + 1] = "New Value"
-- The table size will grow by the time the game extend.
So for this type of situation did array based table with nil value inside that grow by the time of new table value inserted will have better perfomance or should I move into table with key?
Or I should just stick with array based table and perform full cleanup when the table isnt used?
If you set an element in a table to nil, then that just stays there as a "hole" in your array.
tab = {1, 2, 3, 4}
tab[2] = nil
-- tab == {1, nil, 3, 4}
-- #tab is actually undefined and could be both 1 or 4 (or something completely unexpected)!
What you need to do is set the field to nil, then shift all the following fields to fill that hole. Luckily, Lua has a function for that, which is table.remove(table, index).
tab = {1, 2, 3, 4}
table.remove(tab, 2)
-- tab == {1, 3, 4}
-- #tab == 3
Keep in mind that this can get very slow as there's lots of memory access involved, so don't go applying this solution when you have a few million elements some day :)
While table.remove(Table, 9) will do the job in your case (removing field from "array" table and shifting remaining fields to fill the hole), you should first consider using "set" table instead.
If you:
- often remove/add elements
- don't care about their order
- often check if table contains a certain element
then the "set" table is your choice. Use it like so
local tab = {
["John"] = true,
["Jane"] = true,
["Bob"] = true,
}
Your elements will be stored as indices in a table.
Remove an element with
tab["Jane"] = nil
Test if table contains an element with
if tab["John"] then
-- tab contains "John"
Advantages compared to array table:
- this will eliminate performance overhead when removing an element because other elements will remain intact and no shifting is required
- checking if element exists in this table (which I assume is the main puspose of this table) is also faster than using array table because it no longer requires iterating over all the elements to find a match, the hash lookup is used instead
Note however that this approach doesn't let you have duplicate values as your elements, because tables can't contain duplicate keys. In that case you can use numbers as values to store the amount of times the element is duplicated in your set, e.g.
local tab = {
["John"] = 1,
["Jane"] = 2,
["Bob"] = 35,
}
Now you have 1 John, 2 Janes and 35 Bobs
https://www.lua.org/pil/11.5.html

RethinkDB: Can I group by fields between dates efficiently?

I'd like to group by multiple fields, between two timestamps.
I tried something like:
r.table('my_table').between(r.time(2015, 1, 1, 'Z'), r.now(), {index: "timestamp"}).group("field_a", "field_b").count()
Which takes a lot of time since my table is pretty big. I started thinking about using index in the 'group' part of the query, then I remembered it's impossible to use more than one index in the same rql.
Can I achieve what I need efficiently?
You could create a compound index, and then efficiently compute the count for any of the groups without computing all of them:
r.table('my_table').indexCreate('compound', function(row) {
return [row('field_a'), row('field_b'), row('timestamp')];
})
r.table('my_table').between(
[field_a_val, field_b_val, r.time(2015, 1, 1, 'Z')],
[field_a_val, field_b_val, r.now]
).count()

How to sort by two values in MongoDB-Ruby?

So I have a Ruby Script where I find all the documents with the type "homework" in the "grades" collection a "students" DB (MongoDB) The thing is, following these instructions:
http://api.mongodb.org/ruby/current/file.TUTORIAL.html
I try to sort by score and then by student id (or viceversa) with:
homeworks.sort(:score, 1).sort(:student_id, 1).to_a
And running the file ("mongo.rb") I get an output of homeworks sorted by score (ascending) but not by student ID... (They're scrambled) If I try to switch values, I get the array ordered by student_id (ascending) but not by score... (In that case, score values are scrambled)
How can I Sort ascending by two arguments in mongo using ruby??
Per documentation, try
homeworks.sort([[:score, 1], [:student_id, 1]]).to_a
How about this:
c = db['grades']
x = c.find({}, {:sort=>[[:student_id, 1], [:score, 1]]}).to_a
This works for me in irb.

Queries on ActiveRecord Association collection object

I have a set of rows which I've fetched from a table. Let's say the object Rating. After fetching this object, I have say 100 entries from the database.
The Rating object might look like this:
table_ratings
t.integer :num
So what I now want to do is perform some calculations on these 100 rows without performing any other queries. I can do this, running an additional query:
r = Rating.all
good = r.where('num = 2') # this runs an additional query
"#{good}/#{r.length}"
This is a very rough idea of what I want to do, and I have other more complex output to build. Let's imagine I have over 10 different calculations I need to perform on these rows, and so aliasing these in a sql query might not be the best idea.
What would be an efficient way to replace the r.where query above? Is there a Ruby method or a gem which would give me a similar query api into ActiveRecord Association collection objects?
Rating.all returns an array of all Rating objects. From there, shift your focus to selecting and mapping from the array. eg.:
#perfect_ratings = r.select{|x| x.a == 100}
See:
http://www.ruby-doc.org/core-1.9.3/Array.html
ADDITIONAL COMMENTS:
Going over the list of methods available for array, I find myself using the following frequently:
To check a variable against multiple values:
%w[dog cat llama].include? #pet_type # returns true if #pet_type == 'cat'
To create another array(map and collect are aliases):
%w[dog cat llama].map(|pet| pet.capitalize) # ["Dog", "Cat", "Llama"]
To sort and drop duplicates:
%w[dog cat llama dog].sort.uniq # ["cat", "dog", "llama"]
<< to add an element, + to add arrays, flatten to flatten embedded arrays into a single level array, count or length or size for number of elements, and join are the others I tend to use a lot.
Finally, here is an example of join:
%w[dog cat llama].join(' or ') # "dog or cat or llama"

MongoDB ranged pagination

It's said that using skip() for pagination in MongoDB collection with many records is slow and not recommended.
Ranged pagination (based on >_id comparsion) could be used
db.items.find({_id: {$gt: ObjectId('4f4a3ba2751e88780b000000')}});
It's good for displaying prev. & next buttons - but it's not very easy to implement when you want to display actual page numbers 1 ... 5 6 7 ... 124 - you need to pre-calculate from which "_id" each page starts.
So I have two questions:
1) When should I start worry about that? When there're "too many records" with noticeable slowdown for skip()? 1 000? 1 000 000?
2) What is the best approach to show links with actual page numbers when using ranged pagination?
Good question!
"How many is too many?" - that, of course, depends on your data size and performance requirements. I, personally, feel uncomfortable when I skip more than 500-1000 records.
The actual answer depends on your requirements. Here's what modern sites do (or, at least, some of them).
First, navbar looks like this:
1 2 3 ... 457
They get final page number from total record count and page size. Let's jump to page 3. That will involve some skipping from the first record. When results arrive, you know id of first record on page 3.
1 2 3 4 5 ... 457
Let's skip some more and go to page 5.
1 ... 3 4 5 6 7 ... 457
You get the idea. At each point you see first, last and current pages, and also two pages forward and backward from the current page.
Queries
var current_id; // id of first record on current page.
// go to page current+N
db.collection.find({_id: {$gte: current_id}}).
skip(N * page_size).
limit(page_size).
sort({_id: 1});
// go to page current-N
// note that due to the nature of skipping back,
// this query will get you records in reverse order
// (last records on the page being first in the resultset)
// You should reverse them in the app.
db.collection.find({_id: {$lt: current_id}}).
skip((N-1)*page_size).
limit(page_size).
sort({_id: -1});
It's hard to give a general answer because it depends a lot on what query (or queries) you are using to construct the set of results that are being displayed. If the results can be found using only the index and are presented in index order then db.dataset.find().limit().skip() can perform well even with a large number of skips. This is likely the easiest approach to code up. But even in that case, if you can cache page numbers and tie them to index values you can make it faster for the second and third person that wants to view page 71, for example.
In a very dynamic dataset where documents will be added and removed while someone else is paging through data, such caching will become out-of-date quickly and the limit and skip method may be the only one reliable enough to give good results.
I recently encounter the same problem when trying to paginate a request while using a field that wasn't unique, for example "FirstName". The idea of this query is to be able to implement pagination on a non-unique field without using skip()
The main problem here is being able to query for a field that is not unique "FirstName" because the following will happen:
$gt: {"FirstName": "Carlos"} -> this will skip all the records where first name is "Carlos"
$gte: {"FirstName": "Carlos"} -> will always return the same set of data
Therefore the solution I came up with was making the $match portion of the query unique by combining the targeted search field with a secondary field in order to make it a unique search.
Ascending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$gt: 'Carlos'}}]}},
{$sort: {'FirstName': 1, '_id': 1}},
{$limit: 10}
])
Descending order:
db.customers.aggregate([
{$match: { $or: [ {$and: [{'FirstName': 'Carlos'}, {'_id': {$gt: ObjectId("some-object-id")}}]}, {'FirstName': {$lt: 'Carlos'}}]}},
{$sort: {'FirstName': -1, '_id': 1}},
{$limit: 10}
])
The $match part of this query is basically behaving as an if statement:
if firstName is "Carlos" then it needs to also be greater than this id
if firstName is not equal to "Carlos" then it needs to be greater than "Carlos"
Only problem is that you cannot navigate to an specific page number (it can probably be done with some code manipulation) but other than it solved my problem with pagination for non-unique fields without having to use skip which eats a lot of memory and processing power when getting to the end of whatever dataset you are querying for.

Resources