I'm trying to scrape a CBS sports page for shot data in the NBA.
Here is the page I'm starting out with and using as a sample: http://www.cbssports.com/nba/gametracker/shotchart/NBA_20131115_MIL#IND
In the source, I found a string that contains all the data that I need. This string, in the webpage source code, is directly under var CurrentShotData = new.
What I want is to turn this string in the source into a string I can use in ruby. However, I'm having some trouble with the syntax. Here's what I have.
require 'nokogiri'
require 'mechanize'
a = Mechanize.new
a.get('http://www.cbssports.com/nba/gametracker/shotchart/NBA_20131114_HOU#NY') do |page|
shotdata = page.body.match(/var currentShotData = new String\(\"(.*)\"\)\; var playerDataHomeString/m)[1].strip
print shotdata
end
I know I must be doing this wrong... it seems so needlessly complex and on top of that it isn't working for me. Could someone enlighten me on the simple way to get this string into Ruby?
Try to replace:
shotdata = page.body.match(/var currentShotData = new String\(\"(.*)\"\)\; var playerDataHomeString/m)[1].strip
with:
shotdata = page.body.match(/var currentShotData = new String\(\"(.*?)\"\)\; var playerDataHomeString/m)[1].strip
changing the (.*) with (.*?) will cause a lazy evaluation (matching of minimal number of characters) of the string which is the behavior you want.
Related
I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.
We are looking to optimize images with a thumbnail version, which are stored under a funky version of the existing URL:
Original Image:
https://image.s3-us-west-2.amazonaws.com/8/flower.jpg
Thumbnail Image:
https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg
I was going to look from the end of the string for the last '/' and replacing it with '/thumbnails/medium_'. In my case this always safe, but I can't figure out this kind of mutation in Ruby on Rails.
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.split('/')[-1] // should give 'flower.jpg'
The issue is to get everything before the last '/' to inject in 'thumbnails/medium_'. Any ideas?
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = s.insert(s.rindex('/')+1, 'thumbnails/medium_')
# The above approach modifies the original string, if this is unsatisfactory, use:
img_url = s.dup.insert(s.rindex('/')+1, 'thumbnails/medium_')
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
img_url = "#{File.dirname(s)}/thumbnails/medium_#{File.basename(s)}"
# => "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
I would probably use URI and Pathname to work with URLs and file paths:
require 'uri'
require 'pathname'
url = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
uri = URI.new(url)
path = Pathname.new(uri.path)
uri.path = "#{path.dirname}/thumbnails/medium_#{path.basename}"
uri.to_s
#=> "https://image.s3-us-west-2.amazonaws.com/8/thumbnails/medium_flower.jpg"
s = "https://image.s3-us-west-2.amazonaws.com/8/flower.jpg"
s.sub /([^\/]+)$/, 'thumbnails/medium_\1'
The s.sub's 2nd argument should be quoted with single quotation mark, or you have to escape the backslash in the \1 part.
UPDATE
s.sub /([^\/]+?)(?=$|\?|#)$/, 'thumbnails/medium_\1'
In case there's a query string or a fragment or both, behind the path, which contains slashes.
It's #[Range] method what you need:
# a little performance optimization - no need to split split string twice
parts = s.split('/')
img_url = parts[0..-2].join('/') + "/thumbnails/medium_" + parts[-1]
On a side note. If you are using some Rails plugin for handling images (CarrierWave or Paperclip), you should use built-in mechanisms for URL interpolation.
I'm using Mechanize in a Rails 4 application. I created a new agent to scrape a page:
clienturl = #bid.mozs.where(is_main: true).first.attributes['url']
agent = Mechanize.new
#page = agent.get('http://' + clienturl)
#url = #page.uri
I can do things like get the uri, title and meta description. I'd like to now get the count of images on the page and how many of those images are missing alt attributes. Is this possible with Mechanize?
Do something like this:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.iana.org/domains/reserved')
doc = page.parser
img_count = doc.search('img').size # => 2
img_w_alt_count = doc.search('img[#alt]').size # => 1
img_count - img_w_alt_count # => 1
Nokogiri is the parser inside Mechanize. parser returns an instance of the parsed DOM. From that we can ask Nokogiri to search for all nodes matching a selector. I used a CSS selector, but you can use XPath also; CSS tends to be more readable and less verbose.
search returns a NodeSet, so size tells us how many nodes matched.
I'm trying to extract a link from an element (.jobtitle a) using mechanize. I'm trying to do that in the link variable below. Anyone know how?
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://id.indeed.com/')
indeed_form = page.form('jobsearch')
indeed_form.q = ''
indeed_form.l = 'Indonesia'
page = agent.submit(indeed_form)
page.search(".row , .jobtitle a").each do |job|
job_title = job.search(".jobtitle a").map(&:text).map(&:strip)
company = job.search(".company span").map(&:text).map(&:strip)
date = job.search(".date").map(&:text).map(&:strip)
location = job.search(".location span").map(&:text).map(&:strip)
summary = job.search(".summary").map(&:text).map(&:strip)
link = job.search(".jobtitle a").map(&:text).map(&:strip)
end
I don't think you can select attributes with css paths.
From the mechanize documentation:
search()
Search for paths in the page using Nokogiri's search. The paths can be XPath or CSS and an optional Hash of namespaces may be appended.
See Nokogiri::XML::Node#search for further details.
You should check out XPaths instead. See e.g.:
Getting attribute using XPath
http://www.w3schools.com/xpath/
You may need to rewrite the way you iterate through the page.
I have a string containing a path:
/var/www/project/data/path/to/file.mp3
I need to get the substring starting with '/data' and delete all before it. So, I need to get only /data/path/to/file.mp3.
What would be the fastest solution?
'/var/www/project/data/path/to/file.mp3'.match(/\/data.*/)[0]
=> "/data/path/to/file.mp3"
could be as easy as:
string = '/var/www/project/data/path/to/file.mp3'
path = string[/\/data.*/]
puts path
=> /data/path/to/file.mp3
Using regular expression is a good way. Though I am not familiar with ruby, I think ruby should have some function like "substring()"(maybe another name in ruby).
Here is a demo by using javascript:
var str = "/var/www/project/data/path/to/file.mp3";
var startIndex = str.indexOf("/data");
var result = str.substring(startIndex );
And the link on jsfiddle demo
I think the code in ruby is similar, you can check the documentation. Hope it's helpful.
Please try this:
"/var/www/project/data/path/to/file.mp3".scan(/\/var\/www(\/.+)*/)
It should return you all occurrences.