Find dynamic words through patterns in LINQ - linq

Here is how the html starts
BUSINESS DOCUMENTATION
<p>Some company</p>
<p>
<p>DEPARTMENT: Legal Process</p>
<p>FUNCTION: Computer Department</p>
<p>PROCESS: Process Server</p>
<p>PROCEDURE: ABC Process Server</p>
<p>OWNER: Some User</p>
<p>REVISION DATE: 06/10/2013</p>
<p>
<p>OBJECTIVE: To ensure that the process server receive their invoices the following day.</p>
<p>
<p>WHEN TO PERFORM: Daily</p>
<p>
<p>WHO WILL PERFORM? Computer Team</p>
<p>
<p>TIME TO COMPLETE: 5 minutes</p>
<p>
<p>TECHNOLOGY REQUIREMENT(S): </p>
<p>
<p>SOURCE DOCUMENT(S): N/A</p>
<p>
<p>CODES AND DEFINITIONS: N/A</p>
<p>
<table border="1">
<tr>
<td>
<p>KPI’s: </p>
</td>
</tr>
</table>
<p>
<table border="1">
<tr>
<td>
<p>RISKS: </p>
</td>
</tr>
</table>
After this there is a whole bunch of text. What I need to do is from the above I need to parse out specific data.
I need to parse out the Department, Function, Process, Procedure. Objective, When to Perform, Who Will Perform, Time To Complete, Technology Requirements, Source Documents, Codes and Definitions, Risks.
I then need to delete this information from the Html column while leaving everything else in-tact. Is this possible in LINQ?
Here is the LINQ query I am using:
var result = (from d in IPACS_Documents
join dp in IPACS_ProcedureDocs on d.DocumentID equals dp.DocumentID
join p in IPACS_Procedures on dp.ProcedureID equals p.ProcedureID
where d.DocumentID == 4
&& d.DateDeleted == null
select d.Html);
Console.WriteLine(result);

This regex worked just fine for me on your input data
(DEPARTMENT|FUNCTION|OBJECTIVE):\s*(?<value>.+)\<
The result is multiple Matches with 2 groups each - the first the key and the second the value. I have only handled two cases, but you can add the rest easily enough.
To remove the information thus parsed, you can do a Regex.Replace with this regex
(?\(DEPARTMENT|FUNCTION|OBJECTIVE):\s*)(?.+)(?\)
and replacement string as
${start}${end}
leaving out value.
In code, this looks kinda like this (quickly typed this out in Notepad++ - may have minor errors).
private static readonly ParseDocRegex = new Regex(#"(?<start>\<p\>(?<name>DEPARTMENT|FUNCTION|OBJECTIVE):\s*)(?<value>.+)(?<end>\</p\>)", RegexOptions.ExplicitCaptured | RegexOptions.Compiled);
...
from html in result
let matches = findValuesRegex.Match(html)
where matches.Success
select new
{
namesAndValues = from m in matches.AsType<Match>()
select new KeyValuePair<string, string>(m.Groups["name"].Value, m.Groups["value"].Value),
strippedHtml = ParseDocRegex.Replace(html, "${start}${end}")
};
This ought to give you the desired output.

It can be done with many LINQ statements but using regular expressions you need only a few lines of code.

For HTML, you need an HTML parser. Try HTML Agility Pack or CsQuery.
Regular expressions can handle simple matches against HTML but are not sufficient for hierarchical structures and queries would be less precise.
Any HTML extraction is going to be fragile as the structure of the HTML charges. HTML is a presentation format and creators seldom care about machine interpretation. At least with a parser, you'll get an accurate model for the presentation markup (assuming it is valid HTML). You'll also get translation of entities into characters and the ability to extract all the descendant text of an element without internal markup elements like bold or italics.
You can use arbitrary assemblies in LINQPad simply by adding a reference, and for expression-based script, you can import designated namespaces automatically.

Related

How to select from between header elements in Ruby

I'm working on a Ruby script that uses Nokogiri and CSS selectors. I'm trying to scrape some data from HTML that looks like this:
<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
(Part 2)
<h2>Title 3</h2>
(Part 3)
Is there a way to select from Part 2 only by specifying the text of the h2 elements that represent the start and end points?
The data of interest in Part 2 is a table with tr and td elements that don't have any class or id identifiers. The other parts also have tables I'm not interested in. Something like
page.css('table tr td')
on the entire page would select from all of those other tables in addition to the one I'm after, and I'd like to avoid that if at all possible.
I'd probably use this as a first attempt:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
<table>
<tr><td>(Part 2)</td></tr>
</table>
<h2>Title 3</h2>
(Part 3)
EOT
doc.css('h2')[1].next_element
.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
Alternately, rather than use css('h2')[1], I could pass some of the task to the CSS selector:
doc.at('h2:nth-of-type(2)').next_element
.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
Once you have the table then it's easy to grab data from it. There are lots of examples how to do it out there.
According to "Is there a CSS selector for elements containing certain text?", I'm afraid there is no CSS selector working on element text. How about first extract "(Part 2)", and then using Nokogiri to select table elements inside it?
text = "" //your string, or content from a file
part2 = text.scan(/<h2>Title 2<\/h2>\s+(.+)?<h2>/ms).first.first
doc = Nokogiri::HTML(part2)
# continue select table elements from doc
(Part 2) can not contain any h2 tag, or the regex should be different.
If you know that the tables will be static, and the data you require will always be in the second table. You can do something like:
page.css('table')[1].css('tr')[3].css('td')
This will get us the second table on the page, access the 4th row of that table and get us all the values of that row.
I haven't tested this, but this would be the way I would do it if the table I require doesn't have a class or identifier.
I'd probably use this as a first attempt:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
<table>
<tr><td>(Part 2)</td></tr>
</table>
<h2>Title 3</h2>
(Part 3)
EOT
doc.css('h2')[1].next_element.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
Alternately, rather than use css('h2')[1], I could pass some of the task to the CSS selector:
doc.at('h2:nth-of-type(2)').next_element
.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
next_element is the trick used to find the node following the current one. There are many "next" and "previous" methods so read up on them as they're very useful for this sort of situation.
Finally, to_html is used above to show us what Nokogiri returned in a more friendly output. You wouldn't use it unless it was necessary to output HTML.

scrapy xpath : selector with many <tr> <td>

Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,

XPath get only first Parent of nested HTML

I am newbie in XPath. Can someone explain how to resolve this problem:
<table>
<tr>
<td>
<table>
<tr>
<td>
<table>
<tr>
<td>Label</td>
<td>value</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
I try to get <tr> which contains Label value, but it does not work for me,
Here is my code :
//td[contains(.,'Label')]/ancestor::tr[1]
Desired result:
<tr>
<td>Label</td>
<td>value</td>
</tr>
Can someone help me ?
This expression matches the tr that you want:
//tr[contains(td/text(), 'Label')]
Like yours, this starts by scanning all tr elements in the document, but this version uses just a single predicate. The td/text() limits the test to actual text nodes which are grandchildren of the row. If you just used td, then all of the td's descendant text nodes would be collected and concatenated, and the outer tr would match.
UPDATE: Also, for what it's worth, the reason your expression isn't working is that the ancestor axis returns elements in document order, not "outward" from the point of the context node. This is something I've run into myself, as it is somewhat unintuitive. To make your approach work, you would need to say
//td[contains(.,'Label')]/ancestor::tr[last()]
instead of
//td[contains(.,'Label')]/ancestor::tr[1]
I had the same issue, except that the text 'Label' was sometimes in a nested span, or even further nested in the td. For example:
<td><span>Label</span></td>
The previous answer only finds 'Label' if it is in a text element that is a direct child of the td. This issue is a bit harder because we need to search for a td that contains the text 'Label' in any of its children. Since the tds are nested, all tds qualify as having a descendant that contains the text 'Label'. So, the only way I found to overcome this is to add a check that makes sure that the td we select does not contain a td with the search text.
//td[contains(., 'Label') and not(.//td[contains(., 'Label')])]/ancestor::tr[1]
This says give me all of the tds that have a decedent text containing 'Label', but exclude all tds that contain a td that has a decedent text containing 'Label' (nesting ancestors). This returns the child most td that contains the text. Then you can go back to the tr that contains this td using ancestor.
Also, if you just want the lowest table that contains text use this:
//table[contains(., 'Label') and not(.//table[contains(., 'Label')])]
or you can select the tr directly:
//tr[contains(., 'Label') and not(.//tr[contains(., 'Label')])]
This seems like a common problem, but I didn't see a solution anywhere. So, I decided to post to this old unanswered question in hopes that it helps somebody.

xpath with multiple contains statements do not function correctly

I have html code as follows below. I am trying to access it with selenium. If I do a
//*[contains(text(),'Add OfficeContract (Portal)')]
it finds several (there is more html that has more occurrences). So I want to find a specific instance but when I try
//*[contains(text(),'Add OfficeContract (Portal)') and contains(text(),'7121995')]
There are no matches found. SImpy doing
//*[contains(text(),'7121995')]
Finds all sorts of stuff (html is full of that string)
HTML CODE
<tr class="pd" valign="top"><br>
<td> </td><br>
<td nowrap="">SQAAUTO</td><br>
<td nowrap="">01/30/2014 9:47:48 AM</td><br>
<td><br>
<b>Add OfficeContract (Portal)</b><br>
<br><br>
Office Id 7121995<br>
<br><br>
Contract ID added: "8976504"<br>
<br><br>
Term Date added: "12/31/9999"<br>
<br><br>
</td><br>
</tr>
I believe the issue here is that the two strings are not found together in the same element (based on your sample).
For the above xpath to return a result you would need an element like this:
<b>Add OfficeContract (Portal) 7121995</b>

<td> does not display full contents (Mozilla Firefox)

The code goes like this
<div id='blogbook'></div>
...
<script>
...
var z="<table>
<td>Blog title and date<br><hr></td>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td></table>";
function disp(){
document.getElementById('blogbook').innerHTML=z;
}
disp();
</script>
The display comes out like this..
Blog title and date
A very long string consisting of
...(many many lines)...
many paragraphs, sa
The whole of the blog does not display, instead stops long before the actual end of the blog. Questions:
Why does this happen?
How does one solve this?
This problem occurs in Firefox(I'm using v7 but IE displays it just fine, that is, the complete blog)
Your HTML markup is incorrect.
var z="<table>
<td>Blog title and date<br><hr></td>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td></table>";
That code is this:
<table>
<td>Blog title and date<br><hr></td>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td>
</table>
It should be:
<table>
<tr>
<td>Blog title and date<br><hr></td>
</tr>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td>
</tr>
</table>
whats going on with this line <div id='blogbook'></td>? You need to close the div. its not semantically correct and may cause the browser to display incorrectly e.g
<div id='blogbook'></div></td>
Plus your not closing the table above or your not opening a new td if your nesting tables

Resources