<td> does not display full contents (Mozilla Firefox) - firefox

The code goes like this
<div id='blogbook'></div>
...
<script>
...
var z="<table>
<td>Blog title and date<br><hr></td>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td></table>";
function disp(){
document.getElementById('blogbook').innerHTML=z;
}
disp();
</script>
The display comes out like this..
Blog title and date
A very long string consisting of
...(many many lines)...
many paragraphs, sa
The whole of the blog does not display, instead stops long before the actual end of the blog. Questions:
Why does this happen?
How does one solve this?
This problem occurs in Firefox(I'm using v7 but IE displays it just fine, that is, the complete blog)

Your HTML markup is incorrect.
var z="<table>
<td>Blog title and date<br><hr></td>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td></table>";
That code is this:
<table>
<td>Blog title and date<br><hr></td>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td>
</table>
It should be:
<table>
<tr>
<td>Blog title and date<br><hr></td>
</tr>
<tr>
<td>A very long string consisting of many paragraphs, say, a blog</td>
</tr>
</table>

whats going on with this line <div id='blogbook'></td>? You need to close the div. its not semantically correct and may cause the browser to display incorrectly e.g
<div id='blogbook'></div></td>
Plus your not closing the table above or your not opening a new td if your nesting tables

Related

correct way to scrape this table (using scrapy / xpath)

Given a table (unknown number of <tr> but always three <td>, and sometimes containing a strikethrough (<s>) of the first element which should be captured as additional item (with value 0 or 1))
<table id="my_id">
<tr>
<td>A1</td>
<td>A2</td>
<td>A3</td>
</tr>
<tr>
<td><s>B1</s></td>
<td>B2</td>
<td>B3</td>
</tr>
...
</table>
Where scraping should yield [[A1,A2,A3,0],[B1,B2,B3,1], ...], I currently try along those lines:
my_xpath = response.xpath("//table[#id='my_id']")
for my_cell in my_xpath.xpath(".//tr"):
print('record 0:', my_cell.xpath(".//td")[0])
print('record 1:', my_cell.xpath(".//td")[1])
print('record 2:', my_cell.xpath(".//td")[2])
And in principle it works (e.g. by adding a pipeline after add_xpath()), just I am sure there is a more natural and elegant way to do this.
Try contains :
my_xpath = response.xpath("//table[contains(#id, 'my_id')]").getall()

XPath to select specific text inside of text block

I am trying to figure out a way to pull specific values out of a big long text block.
So far I have //td[#class="PadLeft10"] which returns me a big long value starting with the company name and ending with the "View More Info" piece.
I am trying to break my results up into segments, so for example I want my code to look for the words "Primary Contact:" and then return the text that follows that, ending at the <br/>.
I need to get the Company Name, which is always the first bit of text, then the Primary Contact, then the Address, then the Phone and Fax, then the Website, and the Organization type.
The problem is that not every record has all the values. As you can see, the second entry has the address and website, but the first one doesn't.
I am using the Dataminer Chrome Plugin, for anyone familiar with that. It has separate xpath for rows and columns, so I am going to try to make a bunch of different columns that correspond to each of the fields that I am looking for.
Any direction would be greatly appreciated.
<td align="left" valign="top" width="2%">
<script>
if (0 == 1) document.write('<img src="https://website.com" border="0" alt=""/>');
</script>
<br/><br/></td>
<td class="PadLeft10" align="left" valign="top" width="32%" style="padding-left: 15px;">
<span style="font-weight: bold;font-size: 12pt;"><br/>Company Name Here</span><br/>Primary Contact: Mr. Eric Cartman <br/>Phone: (555) 555-5555<br/>Fax: (333) 333-3333<span style="text-decoration: underline;color: #0000ff"></span><br/>Organization Type: Distributor Branch
<br/>
» View More Info<br/>
<br/>
</td>
<td align="left" valign="top" width="2%">
<script>
if (0 == 1) document.write('<img src="https://website.com" border="0" alt=""/>');
</script>
<br/><br/></td>
<td class="PadLeft10" align="left" valign="top" width="32%" style="padding-left: 15px;">
<span style="font-weight: bold;font-size: 12pt;"><br/>Other Company</span><br/>Primary Contact: Mr. Jimmy Valmer<br/>100 N Ohio St 2rd Fl<br/>Rochester, IN 54225<br/>United States<br/>Phone: (888) 888-8888<br/>Fax: (999) 999-9999<span style="text-decoration: underline;color: #0000ff"><br/>Web Site: http://www.companywebsite.com</span><br/>Organization Type: Financial Service
<br/>
» View More Info<br/>
<br/>
</td>
</tr>
<tr>
I am new to xpath, but the least i can say: if you are the creator of the html code, you absolutely need to change it to be more structured
like : Primary Contact:<span id/class='primaryContact'>..</span>
Or else, you can get the elements by this selector (to edit) //td[#class="PadLeft10"]//child::span//following-sibling::text()[1] split by ':' and then proceed, but this solution stay just a diy.
Any direction would be greatly appreciated.
As far as a direction, the sections within table cell that you mention are neither nested DOM items, nor sibling-type DOM nodes. Those are sequential html elements that require special processing.
<br/>Company Name Here</span>
<br/>Primary Contact: Mr. Eric Cartman
<br/>Phone: (555) 555-5555
<br/>...
Both xpath and regex can be leveraged for such a case.
You can select the text node you're looking for using a predicate and the contains function:
//td[#class="PadLeft10"]/text()[contains(., "Primary Contact:")]
Then you can get the substring using the substring-after function:
substring-after(
//td[#class="PadLeft10"]/text()[contains(., "Primary Contact:")],
'Primary Contact:'
)
And remove leading and trailing whitespace using normalize-space:
normalize-space(
substring-after(
//td[#class="PadLeft10"]/text()[contains(., "Primary Contact:")],
'Primary Contact:'
)
)

How can I have Sphinx tables fit to width?

Consider this table, where pyeval is a macro that evaluates an expression and replaces it with its value (so I can avoid hardcoding values in the documentation):
======================= ===========================================
Subsytem Default path
======================= ===========================================
:pyeval:`constants.FOO` :pyeval:`pathutils.DEFAULT_FOO_STORAGE_DIR`
:pyeval:`constants.BAR` :pyeval:`pathutils.DEFAULT_BAR_STORAGE_DIR`
:pyeval:`constants.BAZ` :pyeval:`pathutils.DEFAULT_BAZ_STORAGE_DIR`
======================= ===========================================
This renders with this HTML:
<table border="1" class="docutils">
<colgroup>
<col width="40%">
<col width="60%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd">
<th class="head">Subsystem</th>
<th class="head">Default storage path</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even">
<td><tt class="docutils literal"><span class="pre">foo</span></tt></td>
<td><tt class="docutils literal"><span class="pre">/srv/badp/foo-path/</span></tt></td>
</tr>
<tr class="row-odd">
<td><tt class="docutils literal"><span class="pre">bar</span></tt></td>
<td><tt class="docutils literal"><span class="pre">/srv/badp/bar-path/</span></tt></td>
</tr>
<tr class="row-even"><td><tt class="docutils literal">
<span class="pre">baz</span></tt></td>
<td><tt class="docutils literal"><span class="pre">/var/run/badp/baz-path/</span></tt></td>
</tr>
</tbody>
</table>
Because of the macro, the amount of width I have to give to the Subsytem column is only slightly smaller than the column Default path gets, but the contents of its column are much shorter. Since Sphinx tries to be "helpful", it tries to transfer the ratio of widths in the source file to the HTML page (notice the colgroup tag) and the result is quite uneven:
Notice that Chrome (just like Firefox does) "helpfully" breaks at the hyphenation point and, since this is a path, I don't get to change hyphens to non breaking hyphens; people are just too likely to copy paste these values.
If I remove the colgroup element, however, I get the table I want.
How can I tell Sphinx to please be less smart with my table?
I too have run into this problem. Reading the docutils source, it appears that the colgroup widths are calculated using the number of dashes for the column in the separator lines for grid tables and the number of characters in the longest column entry for the column in simple tables as used here.
An attempt to write a custom directive to generate a table without a colgroup ran into what appears to be a bug in docutils in that later processing of the generated elements expects a colgroup to be present.
One technique I have used is to use aliases to create data items that are closer in length to their real text. For example:
.. |FOO| replace:: :pyeval:`constants.FOO`
which helps but isn't perfect.
An experiment disabling the colgroup element using the following css
colgroup { display: none; }
worked perfectly on FireFox but hid the enter table in IE9 so clearly this isn't an acceptable solution either.
What seems to work (at least in Firefox) is to reset the col widths:
table.docutils col {
width: auto;
}

xpath with multiple contains statements do not function correctly

I have html code as follows below. I am trying to access it with selenium. If I do a
//*[contains(text(),'Add OfficeContract (Portal)')]
it finds several (there is more html that has more occurrences). So I want to find a specific instance but when I try
//*[contains(text(),'Add OfficeContract (Portal)') and contains(text(),'7121995')]
There are no matches found. SImpy doing
//*[contains(text(),'7121995')]
Finds all sorts of stuff (html is full of that string)
HTML CODE
<tr class="pd" valign="top"><br>
<td> </td><br>
<td nowrap="">SQAAUTO</td><br>
<td nowrap="">01/30/2014 9:47:48 AM</td><br>
<td><br>
<b>Add OfficeContract (Portal)</b><br>
<br><br>
Office Id 7121995<br>
<br><br>
Contract ID added: "8976504"<br>
<br><br>
Term Date added: "12/31/9999"<br>
<br><br>
</td><br>
</tr>
I believe the issue here is that the two strings are not found together in the same element (based on your sample).
For the above xpath to return a result you would need an element like this:
<b>Add OfficeContract (Portal) 7121995</b>

Find dynamic words through patterns in LINQ

Here is how the html starts
BUSINESS DOCUMENTATION
<p>Some company</p>
<p>
<p>DEPARTMENT: Legal Process</p>
<p>FUNCTION: Computer Department</p>
<p>PROCESS: Process Server</p>
<p>PROCEDURE: ABC Process Server</p>
<p>OWNER: Some User</p>
<p>REVISION DATE: 06/10/2013</p>
<p>
<p>OBJECTIVE: To ensure that the process server receive their invoices the following day.</p>
<p>
<p>WHEN TO PERFORM: Daily</p>
<p>
<p>WHO WILL PERFORM? Computer Team</p>
<p>
<p>TIME TO COMPLETE: 5 minutes</p>
<p>
<p>TECHNOLOGY REQUIREMENT(S): </p>
<p>
<p>SOURCE DOCUMENT(S): N/A</p>
<p>
<p>CODES AND DEFINITIONS: N/A</p>
<p>
<table border="1">
<tr>
<td>
<p>KPI’s: </p>
</td>
</tr>
</table>
<p>
<table border="1">
<tr>
<td>
<p>RISKS: </p>
</td>
</tr>
</table>
After this there is a whole bunch of text. What I need to do is from the above I need to parse out specific data.
I need to parse out the Department, Function, Process, Procedure. Objective, When to Perform, Who Will Perform, Time To Complete, Technology Requirements, Source Documents, Codes and Definitions, Risks.
I then need to delete this information from the Html column while leaving everything else in-tact. Is this possible in LINQ?
Here is the LINQ query I am using:
var result = (from d in IPACS_Documents
join dp in IPACS_ProcedureDocs on d.DocumentID equals dp.DocumentID
join p in IPACS_Procedures on dp.ProcedureID equals p.ProcedureID
where d.DocumentID == 4
&& d.DateDeleted == null
select d.Html);
Console.WriteLine(result);
This regex worked just fine for me on your input data
(DEPARTMENT|FUNCTION|OBJECTIVE):\s*(?<value>.+)\<
The result is multiple Matches with 2 groups each - the first the key and the second the value. I have only handled two cases, but you can add the rest easily enough.
To remove the information thus parsed, you can do a Regex.Replace with this regex
(?\(DEPARTMENT|FUNCTION|OBJECTIVE):\s*)(?.+)(?\)
and replacement string as
${start}${end}
leaving out value.
In code, this looks kinda like this (quickly typed this out in Notepad++ - may have minor errors).
private static readonly ParseDocRegex = new Regex(#"(?<start>\<p\>(?<name>DEPARTMENT|FUNCTION|OBJECTIVE):\s*)(?<value>.+)(?<end>\</p\>)", RegexOptions.ExplicitCaptured | RegexOptions.Compiled);
...
from html in result
let matches = findValuesRegex.Match(html)
where matches.Success
select new
{
namesAndValues = from m in matches.AsType<Match>()
select new KeyValuePair<string, string>(m.Groups["name"].Value, m.Groups["value"].Value),
strippedHtml = ParseDocRegex.Replace(html, "${start}${end}")
};
This ought to give you the desired output.
It can be done with many LINQ statements but using regular expressions you need only a few lines of code.
For HTML, you need an HTML parser. Try HTML Agility Pack or CsQuery.
Regular expressions can handle simple matches against HTML but are not sufficient for hierarchical structures and queries would be less precise.
Any HTML extraction is going to be fragile as the structure of the HTML charges. HTML is a presentation format and creators seldom care about machine interpretation. At least with a parser, you'll get an accurate model for the presentation markup (assuming it is valid HTML). You'll also get translation of entities into characters and the ability to extract all the descendant text of an element without internal markup elements like bold or italics.
You can use arbitrary assemblies in LINQPad simply by adding a reference, and for expression-based script, you can import designated namespaces automatically.

Resources