I want the xmlns:fb attribute from:
<html xmlns:fb="http://www.facebook.com/2008/fbml"></html>
It doesn't work if I use:
hxs.select('//html/#xmlns:fb')
You may use //html/namespace::fb instead of //html/#xmlns:fb to get URI 'http://www.facebook.com/2008/fbml'
Related
Got the page with dynamic <title> tag depending on language selected by user, e.g.
<title>English</title> or <title>Italiano</title>
I'm trying to select that page among many others with XPath selector:
//*[contains(#title, 'English') or contains(#title, 'Italiano')]
but it doesn't work at all.
Also tried
(//*[contains(#title, 'English')] | //*[contains(#title, 'Italiano')])[1] - no positive result
title is not an attribute, so no need to add #:
//*[contains(title, 'English') or contains(title, 'Italiano')]
This will return parent node. If you want to select title node then try
//title[.='English' or .='Italiano']
I want to select the div with class "bmBidderButtonText" and with "Low" as inner text, what should I do?
<div class="bmBidderButtonText"><div class="bmBidderButtonArrow"></div>Low</div>
<div class="bmBidderButtonText"><div class="bmBidderButtonArrow"></div>High</div>
Merely //div[#class="bmBidderButtonText"] will select two divs, but how should I include the "Low" as inner text as condition within the xpath?
You can use . to reference current context element, so implementing additional criteria of "...and with 'Low' as inner text" in XPath would be as simple as adding and .='Low' in the predicate of your initial XPath :
//div[#class="bmBidderButtonText" and .="Low"]
demo
Try this below xpath
//div[#class='bmBidderButtonText'][text() ='Low']
Explanation:- Use class attribute of <div> tag along with the text method.
use and:
//div[#class="bmBidderButtonText" and contains(., "Low")]
You can use contains() for this reason:
//div[contains(text(), 'Low')]
Additional resources:
Choosing Effective XPaths
I'm trying to scrape content after the occurrence of a particular keyword/string.
Suppose the Xpath is as follows:
<meta property="og:url" content="https://www.example.com/tshirt/pcid111-31">
<meta property="og:url" content="https://www.example.com/tshirt/pcid3131-33">
<meta property="og:url" content="https://www.example.com/tshirt/pcid545424524-84">
1) How can I extract all the data inside the content element whose property="og:url
2)I also want to extract anything which is after the pcid, can someone suggest a way around this?
Now sure if this would work:
item ["example"] =sel.xpath("//meta[#property='og:url']/text()").extract()[0].replace("*pcid","")
Does the replace take in wildcard character references?
This will extract content attributes of elements whose property="og:url"
og_urls = response.xpath("//meta[#property='og:url']/#content").extract()
For extracting stuff from the url it's usually best to use regex, in your case it would be:
for url in og_urls:
id = re.findall("pcid(.+)") # "pcid(.+)" = any characters after 'pcid'(greedy)
# re.findall() returns a list and you probably want only the first occurrence and there mostlikely only be one anyway
id = id[0] if id else ''
print(id)
or you can split the url at the 'pcid' and take the later value, e.g.
for url in og_urls:
id = url.split('pcid')[-1]
print(id)
Try this
x=len(hxs.select("//meta/#content").extract())
for i in range(x):
print hxs.select("//meta/#content").extract()[i].split('pcid')[1]
Output:
111-31
3131-33
545424524-84
If I have a string that is this
<f = x1106f='0'>something
and The values inside <> can change How would I use regular expressions to isolate "something" and replace the tag?
EDIT:
<(.*?)> Pattern worked
What you need is
string =~ />([^<]+)/
and the something will be captured in $1.
Use the following regex to capture "something":
(?<=>)(.*?)(?=<)
assuming that after "something" there's a closing tag.
Link to fiddle
I have a web page. The HTML source contains this text:
<meta property="og:title" content="John"/>
John is an example, the name may vary.
I am sure that og:title will appear only once in the text.
This is my code:
$browser.goto( url )
x = $browser.html.gsub( /^.*<meta property="og:title" content="(.+?)".>/m, '\1' )
I expected to find the name John in my variable x
The '\1' should give me the first part I put in the parenthesis, i.e. (.+?), i.e. John, right?
Also, I used a dot . to match a slash / , is there a better way?
Using Watir API:
x = browser.meta.attribute_value "content"
I was not able to access the meta element using either css and xpath.
If you only want the value of content:
html = '<meta property="og:title" content="John"/>'
=> "<meta property=\"og:title\" content=\"John\"/>"
html[/property="og:title" content="([^"]+)"/, 1]
=> "John"
If you're not familiar with regex, "([^"]+)" might throw you. It means "from the first ", grab everything until the next ". In effect it means "grab everything inside the double-quotes.
That code will return all of the HTML, with the matching code (which is everything between the start of the string up to and including the />) replaced by 'John'. So that comes down to "John", followed by the HTML that was after the /> of that meta property.
If you only want to extract the name, and that tag occurs only once, you can use something like:
#browser.html =~ /<meta property="og:title" content="(.+?)"/
x = $1