Find phrases in a single document - ruby

I have a web app that allows users to upload text documents (of about 2-3000 words), and a database table with about 50,000 phrases (as strings).
How can I most efficiently find out which phrases appear in each of those uploaded documents? (i.e. is there anything better than brute-forcing it by checking each phrase separately?)
The web app flow should ideally be that on the page load after upload, the app knows which phrases it has found in that one document.
Ideally I'd like a solution in ruby, but suggestions as to other technologies or data structures or anything would be a real help.

I don't know what the database you are using, so I just give a MySQL solution:
require 'mysql2'
content = File.read('/path/to/document.txt')
client = Mysql2::Client.new(:host => "localhost", :username => "root")
sql = "SELECT phrase FROM phrases ORDER BY LENGTH(phrase)"
appeared = client.query(sql, as: :array, stream: true).each.with_object([]) do |row, array|
array << row[0] if content.gsub!(%r[#Regexp.escape(row[0])]i, '')
end
The idea is to shrink the content after each match so that next search will be faster.
DISCLAIMER: Not tested.

Related

How do I lookup Industry by Symbol on Yahoo using ruby?

I am trying to get company information for a given symbol, and I have gotten quotes data using a wonderful 'yahoo-finance' gem, but now I need to get company's industry information, and can't find a way.
Any ideas?
Just add :industry to the list of fields you want returned. available_fields gives you the full list. E.g.,
require 'yahoo_finance'
stocks = YahooFinance::Stock.new(['AAPL'], [:industry, :sector])
# use stocks.available_fields to search for the fields that you want
results = stocks.fetch; nil
results['AAPL'][:industry]
# "Electronic Equipment"
results['AAPL'][:sector]
# "Consumer Goods"

DRY search every page of a site with nokogiri

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
You are missing some things.
A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.
JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.
This:
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
can be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, don't do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.
Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.
Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.
Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.
Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.
Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.
It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.

Easy way to get the LON LAT of a list of cities

I have a list of cities in CSV format which I need to longitude and latitude for.
This is my CSV
GeoSearchString,Title
"New York", "Manhatten"
"Infinite Loop 1, San Francisco", "Apple Headquarter"
Now I am looking for an easy way to get coordinates for those places in JSON format
unless ARGV[0]
puts 'Usage ruby Geocoder.rb cities.csv' unless ARGV[0]
exit
end
Can be rewritten as:
abort 'Usage ruby Geocoder.rb cities.csv' unless ARGV[0]
I'd replace:
CSV.foreach(File.read(file), :headers => true) do |row|
results = []
csv.each do |row|
with:
results = []
CSV.foreach(File.read(file), :headers => true) do |row|
Be VERY careful with:
search = row['GeoSearchString'] rescue continue
title = row['Title'] rescue ''
A single in-line rescue is a loaded, very large caliber, gun pointed at your foot. You have two. In this particular case it might be safe, without unintended side effects, but in general you want to go there very carefully.
I came up with the following script (gist)
#!/usr/bin/env ruby
# encoding: utf-8
require 'geocoder'
require 'csv'
require 'json'
require 'yaml'
abort 'Usage ruby Geocoder.rb cities.csv' unless ARGV[0]
def main(file)
puts "Loading file #{file}"
csv = CSV.parse(File.read(file), :headers => true)
results = []
CSV.foreach(File.read(file), :headers => true) do |row|
# Hacky way to skip the current search string if no result is found
search = row['GeoSearchString'] rescue continue
# The title is optional
title = row['Title'] rescue ''
geo = Geocoder.search(search).first
if geo
results << {search: search, title: title, lon: geo.longitude, lat: geo.latitude}
end
end
puts JSON.pretty_generate(results)
end
main ARGV[0]
If you want your own database of cities, states, and other interesting areas (let's call them places) you can get that for free from the US Geological Survey website. It's called their Topical Gazetteer and has a massive amount of "places" along with geocodes. You can get a full national file which is 80MB or just the Populated Places which is 8MB. Additionally, you can download just the states you are interested in.
These places include:
Populated Places – Named features with human habitation—cities, towns, villages, etc. Subset of National file above.
Historical Features – Features that no longer exist on the landscape or no longer serve the original purpose. Subset of National file above.
Concise Features – Large features that should be labeled on maps with a scale of 1:250,000. Subset of National file above.(last updated October 2, 2009)
All Names – All names, both official and nonofficial (variant), for all features in the nation.
Feature Description/History – Includes the following additional feature attributes: Description and History. This file is not a standard topical gazetteer file. If you need these additional feature attributes, you will need to associate the data, using the feature id column, with the data in one of our other files, such as those under the "States, Territories, Associated Areas of the United States" section.
Antarctica Features – Features in Antarctica approved for use by the US government.
Government Units – Official short names, alphabetic, and numeric codes of States
This is not going to be ZIP codes, but actual cities and towns. Data that comes from the USPS will have lat/lon coordinates based on the centroid of a ZIP code (or group of ZIP codes that, when averaged, represent a city). Because it is ZIP based, the lat/lon will be different than the data that comes from the USGS. They're not interested in ZIP codes. Also keep in mind that ZIP codes change monthly, whenever the USPS needs to revise their delivery routes. Actual city locations really don't change. (ignoring nominal tectonic movement). So a definitive center point lat/lon may be best derived from the USGS data instead of the USPS ZIP-based weighted centroid.

What's the right database for this? Mongo, SQL, Couch or something else?

Let's say I've got a collection of 10 million documents that look something like this:
{
"_id": "33393y33y63i6y3i63y63636",
"Name": "Document23",
"CreatedAt": "5/23/2006",
"Tags": ["website", "shopping", "trust"],
"Keywords": ["hair accessories", "fashion", "hair gel"],
"ContactVia": ["email", "twitter", "phone"],
"Body": "Our website is dedicated to making hair products that are..."}
I would like to be able to query the database for an arbitrary number of, including 0 of, any of the 3 attributes of Tags, Keywords, and ContactVia. I need to be able to select via ANDS (this document includes BOTH attributes of X and Y) or ORs (this document includes attributes of X OR Y).
Example queries:
Give me the first 10 documents that have the tags website and
shopping, with the keywords matching "hair accessories or fashion"
and with a contact_via including "email".
Give me the second 20 documents that have the tags "website" or
"trust", matching the keywords "hair gel" or "hair accessories".
Give me the 50 documents that have the tag "website".
I also need to order these by either other fields in the documents
(score-type) or created or updated dates. So there are basically four "ranges" that are queried regularly.
I started out SQL-based. Then, I moved to Mongo because it had support for Arrays and hashes (which I love). But, it doesn't support more than one range using indexes, so my Mongo database is slow..because it can't use indexes and has to scan 10 million documents.
Is there a better alternative. This is holding up moving this application into production (and the revenue that comes with it). Any thoughts as to the right database or alternative architectures would be greatly appreciated.
I'm in Ruby/Rails if that matters.
When needing to do multiple queries on arrays, we found the best solution, at least for us, was to go with ElasticSearch. We get this, plus some other bonuses. And, we can reduce the index requirements for Mongo.. so it's a win/win.
My two cents are for MongoDB. Not only can your data be represented, saved, and loaded as raw Ruby hashes, but Mongo is modern and fast, and really, really easy to know. Here's all you need to do to start Mongo server:
mongod --dbpath /path/to/dir/w/dbs
Then to get the console , which is just a basic JavaScript console, just invoke mongo. And using it is just this simple:
require 'mongo'
db = Mongo::Connection.new['somedb']
db.stuff.find #=> []
db.stuff.insert({id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'}]
db.stuff.update({id: 'abcd', {'$set' => {says: 'Bork bork bork!!!! (Bork)!'}}})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!!!! (Bork)!'}]

Using the Rally Rest API for CRUD operations

At my company, we recently started using Rally for our project management tool. Initially, someone external to our team invested a lot of time manually creating iterations using a naming convention that is just not going to jive with our team's existing scheme. Instead of asking this poor soul to delete these empty iterations by hand, one by one, I would like to automate this process using Rally's REST API. In short, we need to delete these 100+ empty iterations which span across 3 different projects (which all sharing a common parent).
I have spent some time looking at the rally-rest-api ruby gem, and although I have some limited Ruby experience, the Query interface of the API remains confusing to me, and I am having some trouble wrapping my head around it. I know what my regex would like, but I just don't know how to supply that to the query.
Here is what I have so far:
require 'rubygems'
require 'rally_rest_api'
rally = RallyRestAPI.new(:username => "myuser",
:password => "mypass")
regex = /ET-VT-100/
# get all names that match criteria
iterations = rally.find(:iteration) { "query using above regex?" }
# delete all the matching iterations
iterations.each do |iteration|
iteration.delete
end
Any pointers in the right direction would be much appreciated. I feel like I'm almost there.
I had to do something similar to this a few months back when I wanted to rename a large group of iterations.
First, make sure that the user you are authenticating with has at least "Editor" role assigned in all projects from which you want to delete iterations. Also, if you have any projects in your workspace which you do not have read permissions, you will have to first supply a project(s) element for the query to start from. (You might not even know about them, someone else in your organization could have created them).
The following gets a reference to the projects and then loops through the iterations with the specified regex:
require 'rubygems'
require 'rally_rest_api'
rally = RallyRestAPI.new(:username => "myuser",
:password => "mypass")
# Assumes all projects contain "FooBar" in name
projects = rally.find(:project) { contains :name, "FooBar"}
projects.each do |project|
project.iterations.each do |iteration|
if iteration.name =~ /ET-VT-100/
iteration.delete
end
end
end
Try:
iterations = rally.find(:iteration) { contains :name, "ET-VT-100" }
This assumes that the iteration has ET-VT-100 in the name, you may need to query against some other field. Regexes are not supported by the REST api, afaict.

Resources