Detect link with no href - preg-match

$string='<dt class="comment-author " id="c6290261621627103206">
<a name="c6290261621627103206"></a>
bob
said...
</dt>';
I want to be able to detect if a tag is missing the href. I have tried:
\<a(?!href)(.*?)\<\/a\>
and various preg_match variations on that. But sadly to no avail. I should state that I only want to detect, I do not need to output and values from the string.

Using your RegEx, your matches will be very limited.
Try this RegEx, it can find the value of a href tag on the page regardless of where it falls. If there is no value returned, there is no href tag!
(?<=href\=")[^]+?(?=")
The lookahead match of href\=" will look for/match the href at any position instead of RIGHT AFTER the a only.

Related

XPATH - how to get the text if an element contains a certain class

JHow do I grab this text here?
I am trying to grab the text here based on that the href contains "#faq-default".
I tried this first of all but it doesn't grab the text, only the actual href name, which is pointless:
//a/#href[contains(., '#faq-default-2')]
There will be many of these hrefs, such as default-2, default-3 so I need to do some kind of contains query, I'd guess?
You are selecting the #href node value instead of the a element value. So try this instead:
//a[contains(#href, '#faq-default-2')]

How to find xpath of an element under a heading

in a Web page :
<h3 class="xh-highlight">Units Currently On Bed List</h3>
"[total beds=0]
"
i want to find xpath of total beds=0.
how can i do?
Your question and your comment are a bit contradictory. Do you want to find the text after a heading or do you want to find the element containing the text [total beds=0]? Also, how exact do you want to navigate your document?
To find a text after any h3 element you can use this: //h3/following-sibling::text()[1] (see XPath - select text after certain node).
To find a text after an h3 element with the class "xs-highlight" you can use this: //h3[#class='xh-highlight']/following-sibling::text()[1]
To be even more precise you can also look for the heading text: //h3[#class='xh-highlight' and text()='Units Currently On Bed List']/following-sibling::text()[1]
This doesn't match the html in your first comment however, so you might want to adjust the header class and text values. Also, it will find any first text even if there are other elements between it and the h3 element.
Now, your second comment makes it seem you actually want to find the element containing the text. The reason //*[text()='[total beds=0]'] doesn't work is because of the newline in the text. If you can get rid of that in the source it should match, otherwise you can "ignore" it in the xpath by using //*[normalize-space(text())='[total beds=0]']. (This is assuming the quotes around the text in your question aren't actually in the document.)

How to get href with Watir using Ruby

I'm trying to use Watir to grab a specific link on a page:
Screenshot: Here is the href I am trying to grab.
My guess is I need to specify the ancestor element biz-website(?) then traverse down to the a tag and grab its href somehow, but I'm not sure what the syntax of my code would need to be do that.
Any ideas or tips?
You should be able to get the value of the href with
browser.span(:class, 'biz-website').a.href
If the class 'biz-website' is not unique for spans on your page, you can also use 'biz-website js-add-url-tagging'. If that is still not unique, you could also try
browser.span(:text, 'Business website').parent.a.href

Ruby/Regex: Dealing with strings containing forward slashes and parentheses using gsub and regex

Hi I am using Watir to click through some links. I go to a page, click a link based on its text, and the do it again click a new link. I am locating the links based on their text (it is the only way I can based on their HTML) and need to match the text I pulled from the page to the link. The text that I get contains some extra text not part of the link, so I need to gsub it out. Here is my issue:
String: text = "Nuclear Launch Codes (Levels One/Two)"
Link: Nuclear Launch Codes (Levels One/Two) Blah Blah Blah
Because the links do not always have the exact text I need to locate them like so: /#{text}/
Problem is that returns "Nuclear Launch Codes (Levels One\/Two)"
I though I would gsub the 1st parenthesis and everything after, but I need to keep it because I can have Nuclear Launch Codes (Levels Four/Five)
Is there anyway to modify the string to match the link while ignoring the rest of the link text?
If I understand you correctly, try:
/#{Regexp.escape(text)}/
Or equivalently, if you prefer:
Regexp.new(Regexp.escape(text))
This will automatically escape parentheses, slashes and so on in the text so they are not treated as special regexp characters.

Referring to "title" of <a> in Xpath request

I'm making an Xpath as part of a scraping project I'm working on. However, the only defining feature of the text I want is the title attribute of the enclosing <a> tag like so:
This is what I want to scrape
Is it at all possible to refer to that title and create a path like this?
//tr/td[style='vertical-align:top']/a[title='Vacancy details']
Attributes in XPath expressions need to be prefixed with the # symbol...
//tr/td/a[#title='Vacancy details']
//tr/td/a[#title='Vacancy details']/#title
You can grab just the title if that's all you want

Resources