How to parse malformed HTML with Ruby and Mechanize - ruby

I am using Mechanize to navigate a site which has badly-malformed HTML. In particular, I have a page which has checkboxes outside of a form which the server handles the requests sanely.
I would like to check these boxes and click a "Submit" button which is also outside the form, however, I can't use Form.checkbox_with because I don't have a Form object, I only have the Page. I can locate the checkbox on the page with
page.search("//input[#name='silly-checkbox']")
but I can't check it afterwards because Nokogiri is only used for scraping and does not track state. Is that incorrect?
How can I get a Mechanize::Form::Checkbox object when my checkbox is not in a form?

You could manually load the remote page using Nokogiri, then fix the markup by finding the checkboxes outside the form and wrap them, and construct Mechanize classes by yourself from the fixed HTML code.

You can modify your form by deleting and merging new fields.
form.add_field!('gender', 'male')
rdoc here

Related

How to click an element on the page without loading it with capybara?

Is it possible to click on the element without loading the page using
capybara and Ruby?
No that is not possible. You need to load the page for the browser to know what to do when the element is clicked.
If the element being clicked just triggers a web request and you know what that web request is then you could just make that request using any of the ruby networking libraries but then you may run into issues with referrer, cookies, etc.

crawling dynamic content using ruby

I am using ruby gems (nokogiri & mechanize) to make a crawler for a website but this website contains bootstrap modals (popup windows) that's generated dynamically on button click.
this content (of modal) shows up on button click that uses a "get" method on some URL.
I am getting the response by crawling the URL associated with the button
but I am just getting the same page source.
how could I get the content of that dynamic content using "ruby" ?
That modal you're describing, with high probability is rendered with a Js. So what you're looking for is not possible, because mentioned libs do not execute Js.
In order to parse pages whose content is Js dependent, you should use other tools, e.g. puppeteer

How can I use mechanize to click button to webscrape a page to get information?

I'm looking to scrape the contents of a page that requires you to press an arrow button in which, information will appear via jquery rather than loading a new page. Since there needs to be a button click, I'm using mechanize for this part instead of nokogiri. What I have so far is
url = "http://brokercheck.finra.org/Individual/Summary/1327992"
mechanize = Mechanize.new
page = mechanize.get(url)
button = page.at('.ArrowExpandDsclsr.faangledown')
new_page = mechanize.click(button)
new_page.at('#disclosuredetails')
It appears that new_page still doesn't show the page with the newly loaded information. Anyone know why that is?
The button you're trying to get mechanize to click is not a "regular" button, it's a bit more dynamic. It uses javascript / ajax to fetch the relevant data when clicked.
Mechanize doesn't render the DOM of a web page nor it provides a way to have javascript interact with the page. Thus it's not suitable for interacting with dynamic pages depending on javascript for their functionality.
For such cases, I'd suggest phantomjs, either standalone or through capybara / poltergeist if you'd prefer interacting with it via ruby.

MVC3 AjaxHelper choosing selected DDL values without custom javascript

I'm creating a site where I don't want anything to be done via custom javascript/jquery code at all, and I'm not sure it's going to be possible so need some advice.
The things that I want to be able to do are:
Load a JQuery (or Jquery style) dialog box containing a partial view.
Have a button that will select the "SelectedValue" from a dropdown list and render a partial view. (e.g. select a user from a dropdown and then click a button to add them to a list)
Append a partial view to an existing div.
I'm sure that all the above can be done using custom javascript, but what I want to is to use the standard Ajax and Html helpers that come with MVC3.
Main reason is that I've been struggling to get to grips with jQuery, but I also thought it would be nice to see if it can all be done without having to add a new script to the site.
As always, any help is greatly appreciated (even if it's just pointing me to articles).
The standard Ajax and Html helpers that come with MVC3 are for handling server-side stuff. Even the #Html.ValidationMessageFor helper usually uses their unobtrusive validation lib for jQuery validate.
Keep trying at jQuery, you'll get it. There is a reason it is so popular!
Update
I believe you can do #3 using #Ajax.ActionLink, but don't think you can do 1 or 2 out of the box with the Ajax html helper.
It seems that you can use the Ajax.BeginForm method to do this.
There are issues with the fact that it has been in a separate form as I can't nest the forms though.

How do I fetch AJAX-loaded content from an another site using Nokogiri?

I was trying to parse some HTML content from a site. Nokogiri works perfectly for content loaded the first time.
Now the issue is how to fetch that content which is loaded using AJAX. For example, there is a "see more" link and more items are fetched using AJAX, or consider a case for AJAX-based tabs.
How can I fetch that content?
You won't be able to parse anything that requires a JavaScript runtime to produce that content using Nokogiri. Nokogiri is a HTML/XML parser, not a web browser.
PhantomJS on the other hand is a web browser, albeit a special kind of browser ;) Take a look at that and have a play.
It isn't completely clear what you want to do, but if you are trying to get access to additional HTML that is loaded by AJAX, then you will need to study the code, figure out what URL is being used for the AJAX request, whether any session IDs or cookies have been set, then create a new URL that reproduces what AJAX is using. Request that, and you should get the new content back.
That can be difficult to do though. As #Nuby said, Mechanize could be good help, as it is designed to manage cookies and sessions for you in the background. Mechanize uses Nokogiri internally so if you request a page from Mechanize, you can use Nokogiri searches against it to drill down and extract any particular JavaScript strings. They'll be present as text, so then you can use regex or substring matches to get at the particular parameters you need, then construct the new URL and ask Mechanize to get it.

Resources