Crawl a website that has an Ajax table using R - ajax

I'm new to R and have been trying to crawl this website: http://rera.rajasthan.gov.in/ProjectSearch
I'm trying to get the list of all projects in the table including the url to the "View" button but have been failing miserably.
The table appears once you've clicked the Search button below the form.
So far I've tried using Rvest unsuccessfully because I can't seem to find a url or a pagination change variable to try and crawl the table present on the site.
Is there a way to crawl all 788 items in the table?
Should I be using some other tool or Rselenium?

You can combine RSelenium and rvest. This is a code snippet to get the links on the first page.
1) Start Selenium. The best tutorial is found here on StackOverflow: can't execute rsDriver (connection refused).
In short, install Docker and the headless browser, start docker in terminal with docker run -d -p 4445:4444 selenium/standalone-chrome
2)
Then go in RStudio and use these lines to start RSelenium, get on the page, click the Search button and harvest the links:
library(RSelenium)
library(rvest)
library(tidyverse)
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("http://rera.rajasthan.gov.in/ProjectSearch")
# find and click
search <- remDr$findElement(using = "id", value = "btn_SearchProjectSubmit") # get search button
search$sendKeysToElement(list("\uE007")) # click search button
# get the html code, probably not neccessary, but I prefer it this way
html <- remDr$getPageSource() %>% .[[1]] %>% read_html()
html <- as.character(html)
# get the links
links <- html %>% read_html() %>% html_nodes("#OuterProjectGrid td a") %>% html_attr("href")
Then you should implement the pagination, e.g. with map from purrr.

Related

Loop through drop down with dynamic html table

I'm hard stuck with this one so any advice welcome!
Ive been trying to create a flow that goes to this website https://dlv.tnl-uk-uni-guide.gcpp.io/ and scrapes the data from each table in the Subject Areas drop down list. My knowledge of HTML is sketchy at best but from what I understand it's a dynamic html table that just refreshes with new data rather than going to a new url. I can extract the subject list as a variable and in my head i think i just need to add this to a UI selector action but despite numerous attempts i've got absolutely nowhere. Anyone got any ideas as to how i could fix this or work around?
Because it is not a conventional drop-down using the "Set drop-down list value on web page" doesn't work all that well.
You can use a bit of javascript and variables for this.
Hit F12 to show developer tools, you will see there is a list of hidden items with the class class="gug-select-items gug-select-hide" you will use this in the javascript.
Then add a 'Press button on web page' function and add the 'drop-down' element, which is a <div>
Then edit the element selector and change it to text editor.
then change the selector to make use of the nth-child(0) selector but use a variable for the index.
so it looks something like #gug-overall-ranking-select > div.gug-select-items > div:nth-child(%ddIdx%)
Use the "Run JavaScript function on web page" function to get the number of options available to the drop-down. (child elements)
The returned result is text, so convert it to a number that can be used in the loop.
function ExecuteScript() { /*your code here, return something (optionally); */
var firstDDlist = document.querySelector("#gug-overall-ranking-select > div.gug-select-items.gug-select-hide");
return firstDDlist.children.length;
}
In the loop each element will be pressed and cause the table to reload.
The table data extraction can then also be done in the loop, but that this code only shows the looping through the options.
The full flow 'code' (copy this and paste it in power automate).
WebAutomation.LaunchEdge.LaunchEdge Url: $'''https://dlv.tnl-uk-uni-guide.gcpp.io/?taxonomyId=36&/#gug-university-table''' WindowState: WebAutomation.BrowserWindowState.Normal ClearCache: False ClearCookies: False WaitForPageToLoadTimeout: 60 Timeout: 60 BrowserInstance=> Browser
WebAutomation.ExecuteJavascript BrowserInstance: Browser Javascript: $'''function ExecuteScript() { /*your code here, return something (optionally); */
var firstDDlist = document.querySelector(\"#gug-overall-ranking-select > div.gug-select-items.gug-select-hide\");
return firstDDlist.children.length;
}''' Result=> numberOfItems
Text.ToNumber Text: numberOfItems Number=> itemCount
LOOP ddIdx FROM 1 TO itemCount STEP 1
WebAutomation.PressButton.PressButton BrowserInstance: Browser Control: appmask['Web Page \'h ... sity-table\'']['Div \'gug-select-selected\''] WaitForPageToLoadTimeout: 60
END
It should end up looking like this:
Flow running:
With using Power Automate Desktop (PAD), the goal is to be a low-code solution. Of course knowing HTML is a bonus and will help you on tricky webpages or problems, but not knowing much is alright usually. I'm not really comfortable going to that web page you shared but you could try the below option.
PAD has a built in function in the action pane:
'Browser automation' > 'Web data extraction' > 'Extract data from web page'
Try using that and when asked to add UI Element select the table/dropdown list to see what information you get back. If that doesn't work you might need to try out JavaScript or another method.

Link directly to a notebook page in a view

I have an view that extends the current project view, where we add multiple tabs (notebook pages) to show information from other parts of a project.
One of these pages is an overview page that summarizes what is under the other tabs, and I'd like to link the headlines for each section directly to each displayed page. I've currently solved this by using the index of each tab and calling bootstrap's .tab('show') method on the link within the tab:
$(".overview-link").click(function (e) {
e.preventDefault();
var sel = '.nav-tabs a:eq(' + $(this).data('tab-index') + ')';
$(sel).tab('show');
});
This works since I've attached a data-tab-index="<int>" to each header link in my widget code, but it's brittle - if someone adds a tab later, the current indices will be broken. Earlier I relied on the anchor on each tab, but that broke as well (and would probably break if a new notebook page were inserted as well).
Triggering a web client redirect / form link directly works, but I want to show a specific page in the view:
this.do_action({
type: 'ir.actions.act_window',
res_model: 'my.model.name',
res_id: 'my.object.id',
view_mode: 'form',
view_type: 'form',
views: [[false, 'form']],
target: 'current'
});
Is there any way to link / redirect the web client directly to a specific notebook page tab through the do_action method or similar on FormWidget?
If I understood well you want to select the tab from the JavaScript (jQuery) FormWidget taking into account that the id could change if anybody install another module that adds another tab
Solution 0
You can add a class to the page in the xml form view. You can use the id of the element selected by this class name in order to call the right anchor and select the right tab item. This should happen when the page is completely loaded:
<page class="nb_page_to_select">
$('a[href=#' + $('.nb_page_to_select').attr('id') + ']').click()
NOTE: As you have said the following paragrah I assume that you know where to run this instruction. The solution I suggest is independent of the index.
This works since I've attached a data-tab-index="<int>" to each
header link in my widget code, but it's brittle - if someone adds a
tab later, the current indices will be broken. Earlier I relied on the
anchor on each tab, but that broke as well (and would probably break
if a new notebook page were inserted as well).
Solution 1
When the page is loaded you can get the tab list DOM object like this:
var tablist = $('ul[role="tablist"]')
And then you can click on the specifict tab, selecing by the text inside the anchor. So you don't depend on the tab index:
tablist.find('a:contains("Other Information")').click()
I think if you have two tabs with the same text does not make any sense, so this should be sufficient.
Solution 2
Even if you want to be more specific you can add a class to the notebook to make sure you are in the correct notebook
<notebook class="nt_to_change">
Now you can use one of this expressions in order to select the tab list
var tablist = $('div.nt_to_change ul.nav-tabs[role="tablist"]')
// or
var tablist = $('div.nt_to_change ul[role="tablist"]')
Solution 3
If the contains selector doesn't convince you because it should be equal you can do this as well to compare and filter
tablist.find('a').filter(function() {
return $.trim($(this).text()) === "Other Information";
}).click();
Where "Other Information" is the string of the notebook page
I didn't tried the solution I'm giving to you, but if it doesn't work at least may be it makes you come up with some idea.
There's a parameter for XML elements named autofocus (for buttons and fields is default_focus and takes 1 or 0 as value). If you add autofocus="autofocus" to a page in XML, this page will be the displayed one when you open the view.
So, you can try to add this through JavaScript, when the user clicks on the respective link -which honestly, I don't know how to achieve that by now-. But you can add a distinctive context parameter to each link in XML, for example context="{'page_to_display': 'page x'}". When you click on the link, I hope these context keys will arrive to your JS method.
If not, you can also modify the fields_view_get method (here I wrote how to do that: Odoo - Hide button for specific user) to check if you get the context you've added to your links and add the autofocus parameter to the respective page.
As you said:
This works since I've attached a data-tab-index="" to each header
link in my widget code, but it's brittle - if someone adds a tab
later, the current indices will be broken.
I assume that your app allow multi-user interaction in realtime, so you have to integrate somewhere in your code, an update part function.
This function will trig if something has changed and cleanout the data to rebuilt the index in order to avoid that the current indices will be broken.

Ruby Watir -- Trying to loop through links in cnn.com and click each one of them

I have created this method to loop through the links in a certain div in the web site. My porpose of the method Is to collect the links insert them in an array then click each one of them.
require 'watir-webdriver'
require 'watir-webdriver/wait'
site = Watir::Browser.new :chrome
url = "http://www.cnn.com/"
site.goto url
box = Array.new
container = site.div(class: "column zn__column--idx-1")
wanted_links = container.links
box << wanted_links
wanted_links.each do |link|
link.click
site.goto url
site.div(id: "nav__plain-header").wait_until_present
end
site.close
So far it seems like I am only able to click on the first link then I get an error message stating this:
unable to locate element, using {:element=>#<Selenium::WebDriver::Element:0x634e0a5400fdfade id="0.06177683611003881-3">} (Watir::Exception::UnknownObjectException)
I am very new to ruby. I appreciate any help. Thank you.
The problem is that once you navigate to another page, all of the element references (ie those in wanted_links) become stale. Even if you return to the same page, Watir/Selenium does not know it is the same page and does not know where the stored elements are.
If you are going to navigate away, you need to collect all of the data you need first. In this case, you just need the href values.
# Collect the href of each link
wanted_links = container.links.map(&:href)
# You have each page URL, so you can navigate directly without returning to the homepage
wanted_links.each do |link|
site.goto url
end
In the event that the links do not directly navigate to a page (eg they execute JavaScript when clicked), you will need to collect enough data to re-locate the elements later. What you use as the locator will depend on what is known to be static/unique. As an example, I will assume that the link text is a good locator.
# Collect the text of each link
wanted_links = container.links.map(&:text)
# Iterate through the links
wanted_links.each do |link_text|
container = site.div(class: "column zn__column--idx-1")
container.link(text: link_text).click
site.back
end

How does capybara/selenium grab current URL? Issue with single page site

I am using ruby and capybara(which leverages selenium) to automate walking through a website. After navigating to a new page I verify that the new page URL is what i'm expecting. My issue comes when I walk through an order funnel that is a single page but loads different views.
Some code...
I create my session instance then have additional code opening the browser and walking to a certain point in the website that I wont include
$session = Capybara::Session.new(:selenium)
My line for checking the browser URL without search params ie: everything after '?'
if url == $session.current_url.to_s.split("?")[0]
urlCorrect = true
end
This code works fine when my URL is
https://www.homepage.com
Then I click on a link that takes me to my order funnel ... https://www.homepage.com/order#/orderpage1?option1=something&option2=somethingelse
My function still matches the expected URL. But the issue comes when I move to the second page of the order funnel :
https://www.homepage.com/order#/orderpage2?option1=something&option2=somethingelse
My capybara code to get current url still returns the URL from orderpage1. Im guessing its because there is no postback when moving from orderpage1 to orderpage2 but i dont know how to force a postback or tell capybara to re-grab the url
Any advice will be appreciated.
Thanks
Quick Edit: I forgot to mention this behavior is only in IE. Chrome and Firefox both work correctly with the exact same code
Capybara grabs the current_url by querying the browser - it doesn't cache it. The issue you're probably running into is that clicking the link to move to the next page doesn't wait for the page change to happen, so if you call current_url before the page load has happened you'll still get the original url. There are two solutions for that - 1. use capybara to look for content that doesn't appear until the new page is loaded ( have_content ), 2. use the has_current_path? method which will wait/retry for a period of time until the current_path/url match
$session.has_current_path?('expected path')
There are options if you want to match against the full url, and you can use a regex to match as well - http://www.rubydoc.info/gems/capybara/Capybara/SessionMatchers#has_current_path%3F-instance_method
Thanks to Tom Walpole for finding the bug report for this issue. This link sums up the root of the issue and provides a few workarounds if anyone else is encountering this issue.
https://github.com/angular/protractor/issues/132

How to get focus to a new popup window which doesnt have name,id or unique title in selenium

I want to get focus or select an window which pop-up's on click to the link and the link has following tags in html
<a target="_restaurant_50" href="/impersonate/50">View Dashboard</a>
if i put window name as _restaurant_50 it gives the following error in IDE
Window does not exist. If this looks like a Selenium bug, make sure to read http://seleniumhq.org/docs/04_selenese_commands.html#alerts-popups-and-multiple-windows for potential workarounds.
How can i get the focus on this window please help...
I tried all the ways specified on net such as get all windows*, select by title but it gives the parent title only, webdriver PHP switch window, etc.
the number 50 is as per the number in database list.
I am using selenium with PHP.
Details:
Array Names before click
(
[0] => selenium_main_app_window
)
Array Ids before Click
(
[0] => undefined
)
Array Names After click
(
[0] => selenium_main_app_window
)
Array Ids After click
(
[0] => undefined
)E
Update:
I tried opening the url using openWindow("url",Windowname) which gets open on click. It worked but it opens a new page doesnt follow the logged in session it asks again for logging in
Thanks in Advance
I am not familiar with the PHP selenium drivers but the way I did this with Java was to get a list of Window Names (or IDs) before the click, perform the click, get the new list and select the window that wasn't there before. Note that there is a selenium command to get all known window names and ids.
Also, if the issue is just that the number is generated from the database, you could get the target attribute from the link, and then use that to select the window it opens.
I was am able to do the above by following Steps.
we have
<a target="_restaurant_50" href="/impersonate/50">View Dashboard</a>
use
as=getAttribute(//xpath#href)
openWindow(as,MyWindow);
selectWindow("MyWindow");
windowFocus();
This also works for the dynamic links.
Thanks to ALL for your valuable help.

Resources