Im crawling a webpage using Xpath and I need to write the deposit as a number.
The deposit needs to be ("monthly rent" x "amount of prepaid rent")
the result should be: 15450 in this case
<table>
<tr>
<td>monthly rent: </td>
<td>5.150,00</td>
</tr>
<tr>
<td>deposit: </td>
<td>3 mdr.</td>
</tr>
</table>
I am currently using the following XPath to find the info:
//td[contains(.,'Depositum') or contains(.,'Husleje ')]/following-sibling::td/text()
But I don't know how to remove "mdr." from deposit, and how to multiply the to numbers and only return 1 number to the database.
You can use the following query which is compatible with XPath 1.0 and upwards:
substring-before(//td[contains(.,'deposit:')]/following-sibling::td/text(), ' mdr.') * translate(//td[contains(.,'monthly rent:')]/following-sibling::td/text(), ',.', '') div 100
Output:
15450
Step by Step Explanation:
// Get the deposit and remove mdr. from it using substring-before
substring-before(//td[contains(.,'deposit:')]/following-sibling::td/text(), ' mdr.')
// Arithmetic multiply operator
*
// The number format 5.150,00 can't be used for arithmetic calculations.
// Therefore we get the monthly rent and remove . and , chars from it.
// Note that this is equal to multiply it by factor 100. That's why we divide
// by 100 later on.
translate(//td[contains(.,'monthly rent:')]/following-sibling::td/text(), ',.', '')
// Divide by 100
div 100
You can refer to the List of Functions and Operators supported by XPath 1.0 and 2.0
Pure XPath solution:
translate(
/table/tr/td[contains(., 'monthly rent')]/following-sibling::td[1],
',.',
'.'
)
*
substring-before(
/table/tr/td[contains(., 'deposit')]/following-sibling::td[1],
' mdr'
)
It seems I ended up with a solution quite much similar to hek2mgl's correct answer but there is no need for dividing with 100 (comma converted to dot, dot removed) and <td> elements containing numeric data have positional predicates in order to avoid matching more elements, if the actual table is not as simple as the given example. XPath number format requires decimal separator to be a dot and no thousand separators.
Related
I am having trouble getting my head around how to achieve the following. I have gotten this far:
//*[#id="main"]/div[2]/section/div[2]/h1/span[1][starts-with(.,"IDENTIFIER")]/following::span[1]/text()
This will return a response such as:
Foo1 Foo2 Foo3 Foo4
I am trying to make this return only Foo1 & Foo2, where Foo1 & Foo2 can be any length of characters and there may be any number of additional Foo's following them.
I have tried looking at
substring-before(//*[#id="main"]/div[2]/section/div[2]/h1/span[1][starts-with(.,"IDENTIFIER")]/following::span[1]/text(), ' ')
To extract up to the first space however I have hit a wall in what I am doing wrong.
I am using the xpath within a Scrapy spider. Any help is appreciated
Example with :
<table>
<td>Pierre Paul Jacques Marie Maurice Jeanne</td>
</table>
XPath expression :
substring(//td,1,string-length(substring-before(//td," "))+string-length(substring-before(substring-after(//td," ")," "))+1)
Output :
Pierre Paul
The XPath works in 3 steps. First we get the length of the second term with 3 functions (substring-after, substring-before, and string-length). Space is used as delimiter. Then we get the length of the first term with 2 functions (substring-before and string-length). Space is used as delimiter. Finally we use susbstring to extract what we need. Syntax : fn(content of the element,starting position for the extraction (1), ending position (length of text1 + length of text2) + 1(space delimiter)).
You can replace //td with your XPath selector (remove the /text() at the end and try to find a shorter expression).
Let's say we have this:
something
Now is there a way to return the #href like: "www.something/page/2". Basically to return the #href value, but with the substring-after(.,"page/") incremented by 1. I've been trying something like
//a/#href[number(substring-after(.,"page/"))+1]
but it doesn't work, and I don't think I can use
//a/#href/number(substring-after(.,"page/"))+1
It's not precisely a paging think, so that I can use the pagination, I just picked that for an example. The point is just to find a way to increment a value in xpath 1.0. Any help?
What you can do is
concat(
translate(//a/#href, '0123456789', ''),
translate(//a/#href, translate(//a/#href, '0123456789', ''), '') + 1
)
So that concatenates the 'href' attribute with all digits being removed with the the sum of 1 and the 'href' with anything but digits being removed.
That might suffice is all digits in your URLs occur at the end of your URL. But generally XPath 1.0 is good at selecting nodes in your input but bad at constructing new values based on parts of node values.
There is a simpler way to achieve this, just take the substring after the page, add 1, and then munge it all back together:
This XPath is based on the current node being the #href attribute:
concat(substring-before(.,'page/'),
'page/',
substring-after(.,'page/')+1
)
Your order of operations is a little, well, out of order. Use something like this:
substring-after(//a/#href, 'page/') + 1
Note that it is not necessary to explicitly convert the string value to a number. From the spec:
The numeric operators convert their operands to numbers as if by
calling the number function.
Putting it all together:
concat(
substring-before(//a/#href, 'page/'),
'page/',
substring-after(//a/#href, 'page/') + 1)
Result:
www.something/page/2
in an XPath I would like to focus on certain elements and analyse them:
...
<field>aaa</field>
...
<field>bbb</field>
...
<field>aaa (1)</field>
...
<field>aaa (2)</field>
...
<field>ccc</field>
...
<field>ddd (7)</field>
I want to find the elements who's text content (apart from a possible enumeration, are unique. In the aboce example that would be bbb, ccc and ddd.
The following XPath gives me the unique values:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
Now I would like to extent that and perform another XPath on all the distinct values, that would be to count how many field start with either of them and retreive the ones who's count is bigger than 1.
These could be a field content that is equal to that particular value, or it starts witrh that value and is followed by " (". The problem is that in the second part of that XPath I would have refer to the context of that part itself and to the former context at the same time.
In the following XPath I will - instead of using "." as the context- use c_outer and c_inner:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))[count(//field[(c_inner = c_outer) or starts-with(c_inner, concat(c_outer, ' ('))]) > 1]
I can't use "." for both for obvious reasons. But how could I reference a particular, or the current distinct value from the outer expression within the inner expression?
Would that even be possible?
XQuery can do it e.g.
for $s
in distinct-values(
//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
where count(//field[(. = $s) or starts-with(., concat($s, ' ('))]) > 1
return $s
I'm trying to format a column of numbers in rails - using number_to_currency. I want to display negative numbers with ()s - which is easy to do using the negative_format option. However, when I do this, the decimal points in a column of numbers doesn't line up. I want to add a trailing space to the format for positive numbers - %u%n, only I don't know how to do that - can anyone give me the right way to format in a trailing space?
There are a couple of ways you can pad the positive numbers with a little bit of space, either:
Use a fixed width font and a nonbreaking space
Apply a class to the enclosing element of your positive numbers (or wrap in a span) and then use internal padding to add space to the right.
The first approach mixes presentation with content, but has the advantage of working with any font size (or user resized fonts). With the second solution your HTML is cleaner and in well-behaved browsers things will work well, but your mileage will vary with less modern ones.
Here's a quick implementation of the first option:
def pad_positives(number_string)
unless number_string[0,1] == '('
number_string += '%nbsp;'
end
number_string
end
You could drop this in your appropriate helper file and then do something like this in your view:
<%= pad_positives(number_to_currency(number, ...)) %>
Note this function expects a string, so it will choke if you pass it a number. Hope this helps!
Let's say you have the variable amount in your loop:
<td class="amount">
<%= (amount >= 0) ? "#{number_to_currency(amount)} " : "(#{number_to_currency(amount)})" %>
</td>
And in your CSS:
.amount {
font-family:monospace;
}
I'm trying to select an anchor element by first containing the text "To Be Coded", then extracting a number from a string using substring, then using the greater than comparison operator (>0). This is what I have thus far:
/a[number(substring(text(),???,string-length()-1))>0]
An example of the HTML is:
<a class="" href="javascript:submitRequest('getRec','30', '63', 'Z')">
To Be Coded (23)
</a>
My issue right now is I don't know how to find the first occurrence of the open parenthesis. I'm also not sure how to combine what I have with the contains(text(),"To Be Coded") function.
So my criteria for the selection is:
Must be an anchor element
Must include the text "To Be Coded"
Must contain a number greater than 0 in the parentheses
Edit: I suppose I could just "hard code" the starting position for the substring, but I'm not sure what that would be - will XPath count the white space before the text in the element? How would it handle/count the characters?
Here try this :
a[contains(., 'To Be Coded') and number(substring-before(substring-after(., '('), ')')) > 0]