EventMachine not catching nearly simultaneous events - ruby

I'm using EventMachine to process incoming emails which could at times be very high volume. The code that I have so far definitely works for emails that come in separated by at least about 5 seconds, but somewhere below that, only one email will be processed out of however many arrive. I've tried adding EM.defer statements in a few different places which I thought would help, but to no avail. I should also note, if it makes any difference, that I'm using the em-imap gem in this example as well.
The relevant section of the code is here:
EM.run do
client = EM::IMAP.new('imap.gmail.com', 993, true)
client.connect.bind! do
client.login('me#email.com', 'password123')
end.bind! do
client.select('INBOX')
end.bind! do
client.wait_for_new_emails do |response|
client.fetch(response.data).callback do |fetched|
currentSubjectLine = fetched.first.attr.values[1].subject
desiredCommand = parseSubjectLine(currentSubjectLine)
if desiredCommand == 0
if fetched.first.attr.values[0].parts.length == 2
if fetched.first.attr.values[0].parts[1].subtype.downcase != "pdf"
puts 'Error: Missing attachment, or attachment of the wrong type.'
else
file_name = fetched.first.attr.values[0].parts[1].param.values[0]
client.fetch(response.data, "BODY[2]").callback do |attachments|
attachment = attachments[0].attr["BODY[2]"]
File.new(file_name,'wb+').write(Base64.decode64(attachment))
end
end...
Am I somehow blocking the reactor in this code segment? Is it possible that some library that I'm using isn't appropriate here? Could GMail's IMAP server have something to do with it? Do you need any more information about what happens in some given situation before you can answer with confidence? As always, any help is greatly appreciated. Thank you!
Update with Minimized Code
Just in case anything in my organization has anything to do with it, I'm including everything that I think might possibly be relevant.
module Processing
def self.run
EM.run do
client = EM::IMAP.new('imap.gmail.com', 993, true)
client.connect.bind! do
client.login('me#email.com', 'password123')
end.bind! do
client.select('INBOX')
end.bind! do
client.wait_for_new_emails do |response|
client.fetch(response.data).callback do |fetched|
puts fetched[0].attr.values[1].subject
end
end
end.errback do |error|
puts "Something failed: #{error}"
end
end...
Processing.run

Don't hate me for saying this, but refactor that pyramid of doom spaggheti thingy that makes Demeter twitch into something readable and the error will reveal itself :)
If it doesn't reveal itself you will be able to boil it down to the simplest possible code that reproduces the problem and submit it as an issue to https://github.com/eventmachine/eventmachine
However, EM isn't really supported any more, the devs went a bit awol so think about moving to https://github.com/celluloid/celluloid and https://github.com/celluloid/celluloid-io
PS
just saw this
File.new(file_name,'wb+').write(Base64.decode64(attachment))
is a blocking call afaik, try playing with this and you might be able to reproduce the issue. See https://github.com/martinkozak/em-files and http://eventmachine.rubyforge.org/EventMachine.html#defer-class_method on possible ways to go around this

Related

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
##key = key
##link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{##key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{##link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
#result_hash["#{data_name}"] = data_value
else
next
end
end
end
return #result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.
The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.
With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.
You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.
If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.
It is very likely nothing is "blocking" you. As you pointed out
property_page.css("div.cols").css("td.r_align")[4].text
is the problem. So lets focus on that line of code for a second.
Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).
No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.
This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

#driver.find_element(:id=>"body").text.include?(textcheck) not verifying the text only the id

I am using Selenium-WebDriver for Ruby and I am trying to verify that text is present on a page. I have done many searches and tried many things and the best answer I have found is to use something like
def check_page(textcheck)
if verify {#driver.find_element(:id=>"body").text.include?(textcheck)}
yield it_to "fail"
else
yield it_to "pass"
end
end
The expected outcome if the value of textcheck is present in the body would be pass and if the value of textcheck is not present in the body it would be fail. What is actually happening is if :id=>"body" is present then it is pass and if it is not present then it is fail regardless of .text.include?(textcheck)
If anyone could point me in the right direction for how to verify text is present on a page using Selenium-WebDriver in Ruby it would be greatly appreciated. I have found workarounds for certain cases where I can do
verify {#driver.find_element(:tag_name, 'h1').text!=(textcheck)}
but the element I am trying to verify I can't get to so easily. I looked into css locators and was very confused on how to simplify the tag so I could use it. Any help would be greatly appreciated. Thank you very much. If you require any more information from me please let me know and I will provide it as soon as possible.
I am using Ruby 1.93 with Selenium-WebDriver 2.25 testing in Firefox 14.0.1
I do it this way
#wait = Selenium::WebDriver::Wait.new(:timeout => 30)
begin
#wait.until { #driver.find_element(:tag_name => "body").text.include?("your text")}
rescue
puts "Failure! text is not present on the page"
#Or do one of the options below
#raise
#assert_match "true","false", "The text is not present"
end
UPDATE
Answer to your question in the comments section.
There are two kind of "waits", implicit wait and explicit wait. You can read more about it here. The reason your code failed was because you were searching by "id"=>"body" and not by "tag_name"=>"body". Usually all text is encompassed within the "body" HTML tags in your DOM.

Uploading and parsing text document in Rails

In my application, the user must upload a text document, the contents of which are then parsed by the receiving controller action. I've gotten the document to upload successfully, but I'm having trouble reading its contents.
There are several threads on this issue. I've tried more or less everything recommended on these threads, and I'm still unable to resolve the problem.
Here is my code:
file_data = params[:file]
contents = ""
if file_data.respond_to?(:read)
contents = file_data.read
else
if file_data.respond_to?(:path)
File.open(file_data, 'r').each_line do |line|
elts = line.split
#
#
end
end
end
So here are my problems:
file_data doesn't 'respond_to?' either :read or :path. According to some other threads on the topic, if the uploaded file is less than a certain size, it's interpreted as a string and will respond to :read. Otherwise, it should respond to :path. But in my code, it responds to neither.
If I try to take out the if statements and straight away attempt File.open(file_data, 'r'), I get an error saying that the file wasn't found.
Can someone please help me find out what's wrong?
PS, I'm really sorry that this is a redundant question, but I found the other threads unhelpful.
Are you actually storing the file? Because if you are not, of course it can't be found.
First, find out what you're actually getting for file_data by adding debug output of file_data.inspect. It maybe something you don't expect, especially if form isn't set up correctly (i.e. :multipart => true).
Rails should enclose uploaded file in special object providing uniform interface, so that something as simple as this should work:
file_data.read.each_line do |line|
elts = line.split
#
#
end

EOF with Nokogiri

I have the following line in a long loop
page = Nokogiri::HTML(open(topic[:url].first)).xpath('//ul[#class = "pages"]//li').first
Sometimes my Ruby application crashes raising the "End of file reached " exception in this line.
How can I resolve this problem? Just a begin;raise;end block?
Is a script that performs a forum backup, so is important that doesn't skip any thread.
Thanks in advance.
In addition to #Phrogz's excellent advice (in particular about at_css with the simpler expression), I would pull the raw xml [content] separately:
page = if (content = open(topic[:url].first)).strip.length > 0
Nokogiri::HTML(content).xpath('//ul[#class = "pages"]//li').first
end
I would suggest that you should first to fix the underlying issue so that you do not get this error.
Does the same URL always cause the problem? (Output it in your log files.) If so, perhaps you need to URI encode the URL.
Is it random, and therefor likely related to a connection hiccup or server problem? If so, you should rescue the specific error and then retry one or more times to get the crucial data.
Secondarily, you should know that the CSS syntax for that query is far simpler:
page = Nokogiri.HTML(...).at_css('ul.pages li')
Not only is this less than half the bytes, it allows for cases like <ul class="foo pages"> that the XPath would miss.
Using at_css (or at_xpath) is the same as .css(...).first, but is faster and simpler.

How to read someone else's forum

My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying.
I am using Ruby's net/http library for the first time today, and I have encountered a problem. While my browser has no trouble viewing my friend's forum, it seems that the method Net::HTTP.new("forumname.net") produces the following error:
No connection could be made because the target machine actively refused it. - connect(2)
Googling that error, I have learned that it has to do with MySQL (or something like that) not wanting nosy guys like me remotely poking around in there: for security reasons. This makes sense to me, but it makes me wonder: how is it that my browser gets to poke around on my friend's forum, but my little Ruby script gets no poking rights. Is there some way for my script to tell the server that it is not a threat? That I only want reading rights and not writing rights?
Thanks guys,
z.
Scraping a web site? Use mechanize:
#!/usr/bin/ruby1.8
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html # If you want to see the html you just got
posts = page.parser.xpath("//div[#class='postbody']")
for post in posts
title = post.at_xpath('h3//text()').to_s
author = post.at_xpath("p[#class='author']//a//text()").to_s
body = post.xpath("div[#class='content']//text()").collect do |div|
div.to_s
end.join("\n")
puts '-' * 40
puts "title: #{title}"
puts "author: #{author}"
puts "body:", body
end
The first part of the output:
----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.
some sites can only be accessed with the "www" subdomain, so that may be causing the problem.
to create a get request, you would want to use the Get method:
require 'net/http'
url = URI.parse('http://www.forum.site/')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
http.request(req)
}
puts res.body
u might also need to set the user agent at some point as an option:
{'User-Agent' => 'Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'})

Resources