I'm just really struggling with xPath. I've read a couple of guides and I just can't seem to get this right.
Basically, I want to extract all URLs that contain "/ro_ro/".
<link rel="alternate" href="https://www.stackoverflow.com/pl_pl/" hreflang="pl-PL">
<link rel="alternate" href="https://www.stackoverflow.com/pt_br/" hreflang="pt-BR">
<link rel="alternate" href="https://www.stackoverflow.com/pt_pt/" hreflang="pt-PT">
<link rel="alternate" href="https://www.stackoverflow.com/ro_ro/" hreflang="ro-RO">
<link rel="alternate" href="https://www.stackoverflow.com/fi_fi/" hreflang="fi-FI">
Ideally, the xpath query would return: https://www.stackoverflow.com/ro_ro/.
I've got close, but there are multiple links on the page to the same URL, but never with hreflang attribute.
I am to do this on a mass scale, I should note that this means deep page URLs that i want to extract will look like: https://www.stackoverflow.com/ro_ro/xpath-help-for-a-noob/
Edit: Any ideas why this got downvotes?
Try below XPath to get desired href from link element that contains hreflang attribute:
//link[#hreflang and contains(#href, 'ro_ro')]/#href
You can get the hreflang in Google Spreadsheet using the following formula
=importxml("https://example.org" ,"//link[#hreflang]/#href")
You should be able to get those urls with the statement
descendant::link[contains(#href, 'ro_ro')]
with the base node of the document as the current node
the descendant axis tells xpath to look through all child nodes. ::link means to only select the nodes with the name link and the expression within square brackets means "select only those nodes, whose href attribute contains 'ro_ro'-
Related
I'm currently going thru a tutorial on Scrapy. Encountering the following issue when using xpath to filter out certain tag elements from an html file for example.
<html>
<head>
<title>Title of the page</title>
</head>
<body>
<h1>H1 Tag</h1>
<h2>H2 Tag with link</h2>
<p>First Paragraph</p>
<p>Second Paragraph</p>
</body>
</html>
The output for the line response.xpath('/html/head/title').extract() returned a list as such:
['<title>Title of the page</title>\n </head>\n <body>\n <h1>H1 Tag</h1>\n <h2>H2 Tag with link</h2>\n <p>First Paragraph</p>\n <p>Second Paragraph</p>\n </body>\n</html>\n'].
It seems like it was able to start from the correct tag but it doesn't stop at the closing tag. Using Visual Studio Code v.1.65.1. Any help would be greatly appreciated.
As you have not provided links or specific HTML that you are trying to parse, it is not possible to reproduce the problem. You do not have a problem in this XPath or this HTML that you posted. See below my results:
In [1]: response.xpath('/html/head/title').extract()
Out[1]: ['<title>Title of the page</title>']
That being said, you have another problem that I am answering here. extract always returns a list, even if there is only one match. The get the first match as string, the method is extract_first.
That's why Scrapy now recommends using get to get the first match as string, and get_all to get the list of strings. See the docs here.
How can I parse a Meta Tag such as
<meta itemprop="email" content="email#example.com" class="">
..and extract the email out of it.
When I copy the xPath of this tag, I get the following, which doesn't work
//*[#id="businessDetailsPrimary"]/div[2]/div/meta
Please advise.
Many thanks
The likelihood is that the itemprop="email" attribute will be unique across the webpage. In this case, you can select the email by accessing the content attribute via its XPath as follows:
//meta[#itemprop="email"]/#content
Demo
In case itemprop="email" is not unique, you can make your XPath more specific by selecting the element with id equal to businessDetailsPrimary first:
//*[#id="businessDetailsPrimary"]//meta[#itemprop="email"]/#content
Demo
Is it legitimate to set the canonical link to the pound symbol as shown below, or am I required to enter a physical page name?
<link rel="canonical" href="#">
When testing this, the pound setting does not generate a validation error (ala #development=1). In my scenario, the page using this layout file will not have an alternate "regular HTML" version. The only version will be the AMP HTML version.
For additional context, I'm experimenting with an MVC site that will use AMP HTML. To keep my layout file simple, I'd prefer to use the pound symbol rather than extracting the child page name and applying that to the href attribute. I know how to apply the URL to the partial view via code like so:
<link rel="canonical" href="#HttpContext.Current.Request.Url.AbsoluteUri">
I'm just curious if it's legitimate AMP HTML to use the pound symbol instead. Thank you.
From the documentation:
Required markup
AMP HTML documents MUST:
contain a <link rel="canonical" href="$SOME_URL" /> tag inside their head that points to the regular HTML version of the AMP HTML
document or to itself if no such HTML version exists.
So instead of using href="#", you should have it point to itself in order to stay consistent with the AMP specifications.
Validation is evolving, the validator doesn't catch all issues today. The issue with using "#" or any relative URL is that when this document is served elsewhere, such as cdn.ampproject.org, that relative URL will no longer point to your intended canonical. You should instead use an absolute URL <link rel=canonical href="URL">.
I have a jsp with the following (relevant) setup:
<s:url value="/res" var="res_url" />
<link href="${res_url}/less/bootstrap.less" rel="stylesheet/less">
<link href="${res_url}/less/responsive.less" rel="stylesheet/less">
...
Ive noticed a problem with using this technique, in that on the first page load of a new session my res_url variable will have ";jsessionid=xxxxxxxxx" appended. In this case that means the id appears in the middle of my stylesheet URL and therefore the stylesheets are not loaded.
I realize that I'm probably not using the URL tag in the way its intended, and that you can include param tags inside the URL tag to get around this, but I don't like the idea of it and think the way i did it was much cleaner. Is it possible to somehow tell it to ignore the jsessionid? Or is there any other way of doing this?
I don't see the benefit of using Spring's URL tag over the standard JSTL tag. What about
<c:url value="/res/less/bootstrap.less" var="lessBootstrap" />
<link href="${lessBootstrap}" rel="stylesheet/less">
If you want to define the /res/less path in a variable instead of repeating it you may do this like this:
<c:set var="resDir" value="/res/less" scope="request" />
The right way to do it is
<link href="<s:url value="/res/less/bootstrap.less"/>" rel="stylesheet/less">
<link href="<s:url value="/res/less/responsive.less"/>" rel="stylesheet/less">
I don't see what any simpler way to do it.
Example CSS
#wrap{margin:20px}
Code prettify wraps the whole line in .com
<span class="com">#wrap{margin:20px}</span>
Somebody has a similar issue here.
Where someone answers "Are you loading lang-css.js?".
Here's what I'm loading in the footer.
<script src="/js/google-code-prettify/lang-css.js"></script>
<script src="/js/google-code-prettify/prettify.js"></script>
I can see both of them with web inspector. I tried changing the order and loading them from the header. I'm using the latest version.
All help is greatly appreciated :)
Thanks!
The order you link to the javascript files matters. You need to call the base code (prettify.js) first followed by the css specific code (lang-css.js). You can place the script tags either in the head section or at the end of the document... both work but placing at the end of the document will speed up the page load.
<script src="/js/google-code-prettify/prettify.js"></script>
<script src="/js/google-code-prettify/lang-css.js"></script>
You will also need to ensure that you are linking the stylesheet in the head of your document.
<link rel="stylesheet" type="text/css" href="/css/prettify.css">
You also need to add the correct classes your pre tag(s). The syntax-highlighting functions contained in lang-css.js will not be called without adding the class "lang-css" to the <pre> tag.
<pre class="prettyprint lang-css linenums">
Finally, make sure you call the "prettyPrint()" function on page load.
<body onload="prettyPrint()">