I am trying to extract the concatenated cells from a HTML table for each row using XPath. For example, if I have a table like
<table>
<tr><th>FirstName</th><th>LastName</th><th>Title</th></tr>
<tr><td>First1</td><td>Last1</td><td>Title1</td></tr>
<tr><td>First2</td><td>Last2</td><td>Title2</td></tr>
<tr><td>First3</td><td>Last3</td><td>Title3</td></tr>
</table>
I want to extract this data so that I get the full name of the person in each row
First1 Last1
First2 Last2
First3 Last3
I can get each column separately and then merge them in my code later, but prefer to get this done in a single XPath query. I have tried to use concat, but can't figure out where to use the concat.
Thanks in advance.
The concatenation you tried only concats the xpath, not the nodes. If you want to select more than one nodes, you should use | between them.
//tr//td[1] | //tr//td[2]
Related
As in this Stack Overflow answer imagine that you need to select a particular table and then all the rows of it. Due to the permissiveness of HTML, all three of the following are legal markup:
<table id="foo"><tr>...</tr></table>
<table id="foo"><tbody><tr>...</tr></tbody></table>
<table id="foo"><tr>...</tr><tbody><tr>...</tr></tbody></table>
You are worried about tables nested in tables, and so don't want to use an XPath like
table[#id="foo"]//tr.
If you could specify your desired XPath as a regex, it might look something like:
table[#id="foo"](/tbody)?/tr
In general, how can you specify an XPath expression that allows an optional element in the hierarchy of a selector?
To be clear, I'm not trying to solve a real-world problem or select a specific element of a specific document. I'm asking for techniques to solve a class of problems.
I don't see why you can't use this:
//table[#id='foo']/tr|//table[#id='foo']/tbody/tr
If you want one expression without node set union:
//tr[(.|parent::tbody)[1]/parent::table[#id='foo']]
In XPath 2.0, the optional step can be expressed as (tbody|.).
//table[#id="foo"]/(tbody|.)/tr
XPathTester.com demo
The pipe (|) denotes union (of two node-sets), the dot (.) denotes identity step (returning just what the previous step did).
This can be expanded to include more optional elements at once:
//table[#id="foo"]/(thead|tbody|tfoot|.)/tr
Use:
//table[#id="foo"]/*[self::tbody or self::thead or self::tfoot]/tr
|
//table[#id="foo"]/tr
Select any tr element that is a child of any table that has an id attribute "foo" or any tr element that is a child of a tbody that is a child any table.
I have a table in which sometimes some records dont have a value
I am using these Xpath
//table/tbody/tr/td[not(td[string-length(normalize-space(text()))=0])]
//td[not(td[string-length(normalize-space(text()))=0])]
but it selects the whole table, how can I select only the td which are empty?
Thank you for all the help :)
Let's keep things simple. If you want to select tds without text try:
//table/tbody/tr/td[not(text())]
Demo
To complete, two alternatives to select empty td elements (the first one remove the useless parts of your XPath expression (normalize-space(), text(), and td[] inside the predicate) :
//td[string-length()=0]
//td[.=""]
The first XPath will look for td elements where the content length is equal to 0.
The second XPath will look for td elements which contain nothing.
But regarding your XPath tryouts, it seems you want to select td elements which are non-empty. If that's the case, just add a not inside the predicate :
//td[not(string-length()=0)]
//td[not(.="")]
Following 2 xpaths below work fine to extract data from a table.
//*[#id="codeRow"]/td/strong[contains(text(),"Besnier")]
//*[#id="codeRow"]/td[contains(text(),"Besnier")]
I want to combine these 2 and create 1 XPATH statement that can be used as needed.
I tried using or but it did not work
ForEx:
//*[#id="codeRow"]/td[contains(descendant::*/text() , "Besnier" )]
or [contains(text(),"Besnier")]
Please advise
Try this xpath:
//*[#id="codeRow"]/td[contains(., "Besnier")]
XPath engine will convert .(current node) to string, then call function contains().
The current node and all child nodes are searched for a text node fragment "Besnier", there is no need to use an axis to select all descendants and their text nodes.
let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col
I'm trying to create xpath expression which will work with selenium using following html snippet.
Below is table contains various row that gets incremented with uniquely generatedid(for example in following snippet that id is 1000).
Selenium has created following expressions when row of id 1000 was added in table. However instead of using id, I want to create xpath by using 3rd data element in row which is (MyName) in html snippet.
A possible suggestion is to not use xpath whenever possible.
http://saucelabs.com/blog/index.php/2011/05/why-css-locators-are-the-way-to-go-vs-xpath/
You need to convert the places in the XPATH where it is referring to the row by its ID to its relative position in the table.
In all of your XPATHs, you would change tr[#id='1000'] to tr[3]
Your first example XPATH would look liek this:
//tr[3]/td[1]/a[1]/img //tr[#id='1000']/td[1]/span/a/img
Your second example would follow similarly:
//tr[3]/td[1]/span/a/img
As would your third:
//tr[3]/td[1]/a[2]/img
Hopefully you are now able change the rest of them.