How to scrape all reviews if they are on different pages? - ruby

How to scrape all reviews from walmart review page (ex:http://www.walmart.com/ip/Ematic-9-Dual-Screen-Portable-DVD-Player-with-Dual-DVD-Players-ED929D/28806789) if they are on different pages?I scrape by mechanize(nokogiri) but it can't click on button(it is not part of form,then I can't submit it)
<button class="paginator-btn paginator-btn-next"><span
class="visuallyhidden">Next Page</span></button>
and I can't go to next page.How to solve this problem?

Updated answer (post question edit):
I think it may be easier than that. If you pay attention to the product url, you see that there is some kind of ID at the end of the url:
http://www.walmart.com/ip/Ematic-9-Dual-Screen-Portable-DVD-Player-with-Dual-DVD-Players-ED929D/28806789
If you get that ID, you could take the reviews root page (https://www.walmart.com/reviews/product/) and concat the ID of the product:
https://www.walmart.com/reviews/product/28806789
Now, you can iterate over the products, take the trailing ID, and go to each reviews page to get all the reviews.
Hope it helped.
Old answer (pre question edit):
The page you posted is empty for me. However, what I see is that the element is a button, therefore, what you need to do is look for the form and then submit it.
Example taken from Clicking a button with Ruby mechanize (in case the link stops working for some reason):
# get the form
form = agent.page.form_with(:name => "my-form")
# get the button you want from the form
button = form.button_with(:value => "Search")
# submit the form using that button
agent.submit(form, button)
Credit to #flaviu and #serabe from the question stated.
To do the scraping, you should save the root url and go to the review pages, get the reviews, go back to the root url, and so on.

I solve this task with watir gem.Mechanize cant interact with JavaScript.

Related

How does capybara/selenium grab current URL? Issue with single page site

I am using ruby and capybara(which leverages selenium) to automate walking through a website. After navigating to a new page I verify that the new page URL is what i'm expecting. My issue comes when I walk through an order funnel that is a single page but loads different views.
Some code...
I create my session instance then have additional code opening the browser and walking to a certain point in the website that I wont include
$session = Capybara::Session.new(:selenium)
My line for checking the browser URL without search params ie: everything after '?'
if url == $session.current_url.to_s.split("?")[0]
urlCorrect = true
end
This code works fine when my URL is
https://www.homepage.com
Then I click on a link that takes me to my order funnel ... https://www.homepage.com/order#/orderpage1?option1=something&option2=somethingelse
My function still matches the expected URL. But the issue comes when I move to the second page of the order funnel :
https://www.homepage.com/order#/orderpage2?option1=something&option2=somethingelse
My capybara code to get current url still returns the URL from orderpage1. Im guessing its because there is no postback when moving from orderpage1 to orderpage2 but i dont know how to force a postback or tell capybara to re-grab the url
Any advice will be appreciated.
Thanks
Quick Edit: I forgot to mention this behavior is only in IE. Chrome and Firefox both work correctly with the exact same code
Capybara grabs the current_url by querying the browser - it doesn't cache it. The issue you're probably running into is that clicking the link to move to the next page doesn't wait for the page change to happen, so if you call current_url before the page load has happened you'll still get the original url. There are two solutions for that - 1. use capybara to look for content that doesn't appear until the new page is loaded ( have_content ), 2. use the has_current_path? method which will wait/retry for a period of time until the current_path/url match
$session.has_current_path?('expected path')
There are options if you want to match against the full url, and you can use a regex to match as well - http://www.rubydoc.info/gems/capybara/Capybara/SessionMatchers#has_current_path%3F-instance_method
Thanks to Tom Walpole for finding the bug report for this issue. This link sums up the root of the issue and provides a few workarounds if anyone else is encountering this issue.
https://github.com/angular/protractor/issues/132

Loading a page from url does not work, but page work when I click Refresh

I have a URL that displays a customer list like this:
http://domain.com/pls/apex/f?p=724:2:820875406836801:::::
The list of customers are displayed with the title being linked to Page3 & request has CustomerId
When I click the URL http://domain.com/pls/apex/f?p=724:3:21712451478201::NO:RP,3:P3_CUSTOMER_ID:82, Page 3 is loaded correctly with details of selected customer. But the "Update" and "Delete" action buttons never work.
But, if I click the browser refresh button and then try to perform an update or delete, it works.
I don't know where I could be going wrong. Can someone give me hints?
I am not using BRANCH_TO_PAGE_ACCEPT in my URL link definition.
It looks like you have the session ID hardcoded in the URL on page 2:
http://domain.com/pls/apex/f?p=724:2:820875406836801:::::
The session ID is 820875406836801, whereas:
http://domain.com/pls/apex/f?p=724:3:21712451478201::NO:RP,3:P3_CUSTOMER_ID:82
The session ID has mysteriously been changed to 21712451478201. I'm not sure, but I suspect that you've hardcoded the session ID in your report on page 2. This has the effect of causing a new login session to be created when page 3 is opened (and maybe this is why the update/delete buttons don't work - but you haven't told us what the error message is so I'm not sure); refreshing the page may be restoring the session.
If I'm right, what you need to solve this issue is to use the session variable (&SESSION.) in your report on page 2 instead of hardcoding it, e.g.:
http://domain.com/pls/apex/f?p=724:3:&SESSION.::NO:RP,3:P3_CUSTOMER_ID:82
The issue was with the way the url was created. First of all, I should not set only 1 thing (Title) to be a url. It should be the entire div. Like below.
<li><div style="">
<a href="f?p=&APP_ID.:2:&SESSION.::NO::P2_PK_PROJECT_ID:#LINK#" rel="external">
<h3>#TITLE#</h3>
<p><strong>#BOLD_TEXT#</strong></p>
<p>#PLAIN_TEXT#</p>
</a></div>
</li>
A report row template with the above code was created. This template is used in my Customers List page. Now each customer is a link (Title, Name, etc). The link href is also hard-coded. Note that I am passing ProjectID:#LINK# #LINK# refers to a value like 1, 2 etc
Now clicking this, loads page 2 correctly and Apply Changes & Delete button are now clickable.

How can I automate a google link on the page I am working on using watir-webdriver?

Page link I am working on is http://www.whatcar.com/car-news/subaru-xv-review/260397
I am trying to automate 'clicking the google link' but am having no luck and keep receiving an error.
Link HTML:
<a tabindex="0" role="button" title="" class="s5 JF Uu" id="button" href="javascript:void(0);" aria-pressed="false" aria-label="Click here to publicly +1 this."></a>
My code:
#browser.link(:class, "s5 JF Uu").click
Error message:
unable to locate element, using {:class=>"s5 JF Uu", :tag_name=>"a"} (Watir::Exception::UnknownObjectException)
./step_definitions/11.rb:12:in `/^On the page I click 'Twitter' , Facebook and Google button$/'
11.feature:8:in `When On the page I click 'Twitter' , Facebook and Google+ button'
The link is inside a frame. To make it even more fun, frame id is different every time the page is refreshed.
browser.frames.collect {|frame| frame.id}
=> ["I1_1323429988509", "f3593c4f374d896", "f4a5e09c20624c", "stSegmentFrame", "stLframe"]
browser.refresh
=> []
browser.frames.collect {|frame| frame.id}
=> ["I1_1323430025052", "fccfdf9410ef34", "f11036dad706668", "stSegmentFrame", "stLframe"]
I1_1323429988509 and I1_1323430025052 is the frame. Since I1_ part is always the same, and no other frame has that, you can access the frame like this:
browser.frame(:id => /I1_/)
Since there is only one link inside the frame:
browser.frame(:id => /I1_/).as.size
=> 1
You can click the link like this:
browser.frame(:id => /I1_/).a.click
Or if you prefer to be more explicit
browser.frame(:id => /I1_/).a(:id => "button").click
That will open a new browser window, and a new challenge is here! :)
The technical answer:
The class of the button on the page that you linked is different for me than the class that you list. It looks like it behaves differently based on the cookies on your local machine (which would be absent during a Watir-driven Firefox or IE session).
You would need to find a different element that is not dynamic to hook into.
The ethical answer:
It is questionable that you are attempting to automate the promotion of online articles through social media. Watir/Watir-Webdriver is not a spam bot, and the services you are using specifically prohibit the use of automation/bots.
That 'button' link is inside an iframe. Read on the watir Wiki how to deal with stuff in frames. If that's not enough to get it working please edit the answer with revised code and error etc and we can work it forward from that point.

Validator plugin firing with links and on submit

I have a web form which is split up into several HTML pages.
I am using the validation plug-in to check fields on submit and this is working great.
The spec says that users should be able to navigate through the form, both linearly (just using the submit buttons to go from page to page) and also to skip to any particular page.
I have a unordered list with the links at the top of each page. I'm looking to fire the validation both on submit and when one of these links is clicked but don't know if this is possible.
For info, I'm currently firing the validation this way:
$("form#courseDetails").validate({
rules: {
studiedBefore: "required" //Have you studied with us before
},
messages: {
studiedBefore: "Please indicate whether you have studied with us before."
}
});
Each form has an ID and validation for all the forms is in one JS file.
Not that it really matters, but the navigation is in <ul id="tabNav">
Any help much appreciated.
Thanks,
Phil
Check the .valid() method it provides. If you call that in click handlers attached to your links, you should be ok.

is it possible to change page before ajax?

for example:
user submit a comment , I add the comment in the page by javascript , then do the ajax. if ajax post failed ,tell user that something wrong happend.
in this way , it can improve user experience . and the probability of ajax failed is not low. but I didn't seen which site is using this technology , so is this method possible?
Actually, I'd say that stackoverflow uses this technique :
Make sure you are using firebug, and have the console displayed on the bottom of your browser scree
Click on (for instance) the arrow to upvote
you will see the arrow immediatly becomes orange, to indicate you have upvoted)
but looking at firebug's console, you will see the Ajax request starts only after the arrow has changed color -- or, at least, it is not finised yet when the arrow has changed color.
Considering the probably of the Ajax request failing is pretty low, changing the arrow immediatly indicates the user his vote has been taken into account... Even if it's not true before a couple milliseconds ;-)
You can add the comment via Javascript but you've also pointed out exactly why you shouldn't: what if it fails? Do you then remove the content?
In my opinion, adding it to the page implies to the user that it has worked. I would leave the comment in a form field until the AJAX submit succeeds. If that fails you can tell the user and they can try to submit again or whatever.
Of course, there is no functional reason why you couldn't do this.
Yes there is nothing stopping you doing this.
You add the comment in an element you create in javascript post the data and get the response code back form the ajax post.

Resources