Scrapy - How can I handle a random number of elements? - xpath

I have a Scrapy crawler that I can comfortably acquire the first desired paragraph, but sometimes there is a second or third paragraph.
response.xpath(f"string(//h2[contains(text(), '{card}')]/following-sibling::p)").get()
is the xpath code I am using to acquire said paragraph.
response.xpath(f"string(//h2[contains(text(), '{card}')]/following-sibling::p[1])").get() acquires the same paragraph, but sometimes, I need response.xpath(f"string(//h2[contains(text(), '{card}')]/following-sibling::p[2])").get().
How might I go about taking this varying number of paragraphs into account when scraping?

You could try to use a wildcard *.
removed
EDIT : With string() function you'll only get the first paragraph.
Just remove string() from your XPath expression to get all the paragraphs (assuming there are in the same node) and store the result in a variable.
//h2[contains(text(), '{card}')]/following-sibling::p/text()
Alternative : If you know the maximum possible number of paragraph, you can use concat().
concat(//h2[contains(text(), '{card}')]/following-sibling::p[1],'|',//h2[contains(text(), '{card}')]/following-sibling::p[2])

Related

How to get multiple occurences of an element with XPath under usage of normalize-space and substring-before

I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.

How do I write a regex for Excel cell range?

I need to validate that something is an Excel cell range in Ruby, i.e: "A4:A6". By looking at it, the requirement I am looking for is:
<Alphabetical, Capitalised><Integer>:<Integer><Alphabetical, Capitalised>
I am not sure how to form a RegExp for this.
I would appreciate a small explanation for a solution, as opposed to purely a solution.
A bonus would be to check that the range is restricted to within a row or column. I think this would be out of scope of Regular Expressions though.
I have tried /[A-Z]+[0-9]+:[A-Z]+[0-9]+/ this works but allows extra characters on the ends.
This does not work because it allows extra's to be added on to the beginning or end:
"HELLOAA3:A7".match(/\A[A-Z]+[0-9]+:[A-Z]+[0-9]+\z/) also returns a match, but is more on the right track.
How would I limit the number range to 10000?
How would I limit the number of characters to 3?
This is my solution:
(?:(?:\'?(?:\[(?<wbook>.+)\])?(?<sheet>.+?)\'?!)?(?<colabs>\$)?(?<col>[a-zA-Z]+)(?<rowabs>\$)?(?<row>\d+)(?::(?<col2abs>\$)?(?<col2>[a-zA-Z]+)(?<row2abs>\$)?(?<row2>\d+))?|(?<name>[A-Za-z]+[A-Za-z\d]*))
It includes named ranges, but the R1C1 notation is not supported.
The pattern is written in perl compatible regex dialect (i.e. can also be used with C#), I'm not familiar with Ruby, so I can't tell the difference, but you may want to look here: What is the difference between Regex syntax in Ruby vs Perl?
This will do both: match Excel range and that they must be same row or column. Stub
^([A-Z]+)(\d+):(\1\d+|[A-Z]+\2)$
A4:A6 // ok
A5:B10 // not ok
B5:Z5 // ok
AZ100:B100hello // not ok
The magic here is the back-reference group:
([A-Z]+)(\d+) -- column is in capture group 1, row in group 2
(\1\d+|[A-Z]+\2) -- the first column followed by any number; or
-- the first row preceded by any character

Is it possible to get the last occurrence of a newline in a text node with XPath 1.0?

I think the answer is "no" but I'll ask anyway. Can you find the last occurrence of a newline in text node using XPath 1.0?
E.g. Given the following XML I want to find the last newline (immediately after "second") in order to get the text "third".
<element> first
second
third </element>
If I knew the position of the last newline it would be trivial to get the text after it. I don't actually want to return the value, just test against it.
As far as I can tell XPath 1.0 doesn't have any of:
reverse text functions
loops
character axis/node
regex
string split
Any of the above would be enough to solve this problem!
Can you find the last occurrence of a newline in text node using XPath 1.0?
No. XPath generally has not been designed to do string processing.
Of course in XPath 2.0 you can do it by tokenizing the input into sequence and then getting the last element from that sequence. But strictly speaking that does not qualify as text processing, it's sequence handling. In other words, it won't actually give you the position of that last newline character either.
with XPath 1.0 you will have to do this bit of work in the host language.

Verifying searched text displayed is in a single line

How can I test whether a sentence (combination of four or five words) is displayed in a single line?
I have to search with a name or some other fields. After search results are displayed, I should test whether the displayed text is a single line. For example, the code below is used to verify the search result link:
//ol[contains(#class,'search results')]/li[contains(#class,'mod result') and contains(#class,'XXXXXX')]//a[contains(#href,'trk=XXXXXX')]
I am not familiar with ruby, but the following java approach should work in any language.
Assuming that your "sentence" is entirely contained in one element, you could find all occurrences with something like:
driver.findElements(By.xpath("//*[text()='your sentence']"))
Then simply test for the size of the array.
Assuming that a single or multiple lines will be contained within a single DOM element, you could use the vertical component of the element size to check for the multiple line condition.
webElement.getSize()

XPath different in IE and Firefox. Why?

I used Firebug's Inspect Element to capture the XPath in a webpage, and it gave me something like:
//*[#id="Search_Fields_profile_docno_input"]
I used the Bookmarklets technique in IE to capture the XPath of the same object, and I got something like:
//INPUT[#id='Search_Fields_profile_docno_input']
Notice, the first one does not have INPUT instead has an asterisk (*). Why am I getting different XPath expressions? Does it matter which one I use for my tests like:
Selenium.Click(//*[#id="Search_Fields_profile_docno_input"]);
OR
Selenium.Click(//INPUT[#id='Search_Fields_profile_docno_input']);
*[Id=] denotes that it can be any element while the second one clearly mentions selenium to look ONLY for INPUT fields which have id as Search_Fields_profile_docno_input. The second xpath is better due to following reasons
It takes more time to find the element using * as IDs of all elements should be matched.
If your HTML code is not "well written" there could be other elements which have the same id and this could cause your test to fail.
The first one matches any element with a matching ID, whereas the second one restricts matches to <input> elements. If these were CSS expressions it'd be the difference between #Search_Fields_profile_docno_input and input#Search_Fields_profile_docno_input.
Assuming you only use this ID once in your web page, the two XPaths are effectively equivalent. They'll both match the <input id="Search_Fields_profile_docno_input"> element and no other.
There are some good answers to your "why?" question here, but for Selenium use, there's an even better alternative. Since your page element has an ID attribute, use Selenium's ID locator instead of XPath or CSS:
Selenium.Click("id=Search_Fields_profile_docno_input");
This will go directly to the element, and will run quicker than just about any other locator. Note that the syntax is id=value, not id="value".
Given any element in your document, there's an infinite number of XPath expressions that will select it uniquely. Therefore it's entirely reasonable for two different products to generate two different paths.
Google has just released Wicked Good XPath - A rewrite of Cybozu Lab's famous JavaScript-XPath. Link: https://code.google.com/p/wicked-good-xpath/ The rewritten version is 40% smaller and about %30 faster than the original implementation.
You can check this out and replace the one being used in Selenium.

Resources