ImportXML function in Google Dynamic XML path - google-sheets-formula

I am trying to import the headlines and landing page URL's from "New + Updated" section of this page:
https://www.nytimes.com/wirecutter/
The issue is that the class "_988f698c" keeps changing as the headline is being replaced with a new headline/topic.
I need a workaround to use IMPORTXML function which will dynamically capture the class of that object in that position. The current formula is:
=IMPORTXML(https://www.nytimes.com/wirecutter/,"//*[#class='_988f698c']")
Here is the html tag for example. The class "_988f698c" refreshes every hour or so with new headlines coming in.
<li class="e9a6bea7">
<a class="_988f698c" href="https://www.nytimes.com/wirecutter/reviews/gir-spatula-review/">Why We Love GIR Spatulas</a>
<p class="_9d1f22a9">today
</p>
</li>
Is there a way I can do this?

Come back a little and look for an alternative path without forcing the use of random numbers.
For the title, use:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a"
)
For the URL attached to the title:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a/#href"
)
For the text indicating the day of publication:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/p"
)
If you want to collect everything together, use | to split the paths:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a |
//ul[#data-testid='new-and-updated']/li/a/#href |
//ul[#data-testid='new-and-updated']/li/p"
)
only use it if you are absolutely sure that the values will always exist, because if they don't, you will have problems with the position in the sheet rows if you define formulas that depend on fixed values in each of the cells.

Related

scrapy xpath : selector with many <tr> <td>

Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,

Finding an Image Icon Next to a Text Item in Watir-WebDriver

The context is I'm using watir-webdriver and I need to locate if an image appears prior to a particular item in a list.
More specifically, there is a section of the site that has articles uploaded to them. Those articles appear in a list. The structure looks like this:
<div id="article-resources"
<ul class="components">
...
<li>
<div class="component">
<img src="some/path/article.png">
<div class="replies">
<label>Replies</label>
</div>
<div class="subject">
Saving the Day
</div>
</div>
</li>
...
</ul>
</div>
Each article appears as a separate li item. (The ellipses above are just meant to indicate I can have lots of liste items.)
What I want our automation to do is find out if the article has been appropriately given the image article.png. The trick is I need to make sure the actual article -- in the above case, "Saving the Day" -- has the image next to it. I can't just check for the image because there will be multiples.
So I figured I had to use xpath to solve this. Using Firefox to help look at the xpath gave me this:
id("article-resources")/x:ul/x:li[2]/x:div/x:img
That does me no good, though, because the key discriminator seems to be the li[2], but I can't count on this article always being the second in the list.
So I tried this:
article_image = '//div[#class="component"]/a[contains(.,"Saving the Day")]/../img'
#browser.image(:xpath => article_image).exist?.should be_true
The output I get is:
expected: true value
got: false (RSpec::Expectations::ExpectationNotMetError)
So it's not finding the image which likely means I'm doing something wrong since I'm certain the test is on the correct page.
My thinking was I could use the above to get any link (a) tags in the div area referenced as class "component". Check if the link has the text and then "back up" one level to see if an image is there.
I'm not even checking the exact image, which I probably should be. I'm just checking if there's an image at all.
So I guess my questions are:
What am I doing wrong with my XPath?
Is this even the best way to solve this problem?
Using Watir
There are a couple of approaches possible.
One way would be find the link, go up to the component div and then check for the image:
browser.link(:text => 'Saving the Day').parent.parent.image.present?
or
browser.div(:class => 'subject', :text => 'Saving the Day').parent.image.present?
Another approach, which is a little more robust to changes, is to find the component div that contains the link:
browser.divs(:class => 'component').find { |component|
component.div(:class => 'subject', :text => 'Saving the Day').exists?
}.image.present?
Using XPath
The above could of course be done through xpath as well.
Here is your corrected xpath:
article_image = '//div[#class="component"]//a[contains(.,"Saving the Day")]/../../img'
puts browser.image(:xpath => article_image).present?
Or alternatively:
article_image = '//a[contains(.,"Saving the Day")]/../../img'
browser.image(:xpath => article_image).present?
Again, there is also the top down approach:
article_image = '//div[#class="component"][//a[contains(.,"Saving the Day")]]/img'
browser.image(:xpath => article_image).present?
You can read more about these approaches and other options in the book Watirways.

Xpath/HtmlAgilityPack: Getting the specific attributes from href tag

I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles

How to read id value of a DIV element using Selenium WebDriver?

<div id="ctl00_ContentHolder_vs_ValidationSummary" class="errorblock">
<p><strong>The following errors were found:</strong></p>
<ul><input type="hidden" Name="SummaryErrorCmsIds" Value="E024|E012|E014" />
<li>Please select a title.</li>
<li>Please key in your first name.</li>
<li>Please key in your last name.</li>
</ul>
</div>
here is my snippet for example. i want to get the value of ID i.e., ct100_contentHolder_vs_ValidationSummary. using selenium web driver. h
You can try this :
String id=driver.findElementByXpath("//div[#class='errorblock']").getAttribute("id"));
But in this case the class of this division should be unique.
Use following code to extract id of first div:
WebElement div = driver.findElement(By.tagName("div"));
div.getAttribute("id");
This is the code for all div available on the page:
List<WebElement> div = driver.findElements(By.tagName("div"));
for ( WebElement e : div ) {
div.getAttribute("id");
}
I know this answer is really late but I wanted to put this here for those who come later. Searching by XPath should be avoided unless absolutely necessary because it is more complicated, more error prone, and slower. In this case you can easily do what the accepted answer did without having to use XPaths:
String id = driver.findElement(By.cssSelector("div.errorblock")).getAttribute("id");
Some explanation... this line finds the first element (.findElement vs .findElements) using a CSS Selector. The CSS Selector, div.errorblock, locates all div elements with the class (symbolized by the period .) errorblock. Once it is located, we get the ID using .getAttribute().
CSS Selectors are a great tool that all automators should have in their toolbox. There's a great CSS Selector reference here: http://www.w3.org/TR/selectors/#selectors.

JSP: Function call inside loop becomes very slow. Help me optimize

In my JSP, I loop through an object containing a list of employees and display it.
For each employee row, I also provide a link so that the user can view the employee's details. The link calls a Javascript function where the employee ID is passed.
The problem I am having is that the response time dramatically increases with the number of rows in my object. When my object contains over a thousand records, it takes at least 30 secs to render the page. Here's the code:
function getDetail(empID){
//AJAX call to get employee details}
}
.
.
<table>
<c_rt:forEach var="emp" items="${employeeListObj}">
<tr>
<td>
<c:out value="${emp.lastName}" />
</td>
</tr>
</c_rt:forEach>
</table>
I have narrowed down the culprit to the employee ID parameter being dynamic or being evaluated at runtime. I intially thought is it was a JSTL c:out issue, but I also tried changing to an ordinary JSP variable (i.e. getDetail('<%=ctr%>'), and the response time is still slow.
But when I changed it to a static string (i.e. getDetail('some static string')), the response time becomes fine.
I also tried passing it as a function (i.e. onClick="getDetail(function () {return ''})") but response time still didn't improve.
Is there a better (more optimized) way of doing this that will result in a better response time?
Thanks for the replies but I have figured out a simple solution. Not the most elegant, but it's a simple change and it serves my end user's needs (they dont want pagination, just a scrollable DIV area).
Instead of using this statement inside the loop:
<a href="#" onClick="getData('<c:out value="${emp.id}"/>')">
I used the employee ID as the ID of the anchor tag and passed that one instead:
<a id='<c:out value="${emp.id}"/>' href="#" onClick="getData(this.id)">
I don't know why, but the difference was night and day in terms of page rendering time. It now renders for just less than 5 secs compared to over a minute when passing the c:out value directly. I was dealing with 10,000 records btw.
If the problem is in the browser, you could replace the click handlers with attributes containing the number and a single click handler added through Javascript (eg, jQuery)
EDIT: For example:
$('table.SomeClass a').click(function() {
getDetail($(this).attr('data-employeeId'));
return false;
});

Resources