How to remove HTML element (select by class) of string by golang? - go

The example below:
content := "<p>https://github.com/</p>
<div class=\"extract\">
<p>hello1</p>
</div>
<div>hello2</div>
<div class=\"extract\"><p>hello3</p></div>"
I want to remove all "div" that has [class="extract"] include of all children elements too.
I want to get below result
content := "<p>https://github.com/</p>
<div>hello2</div>"
I try to use regex, but it`s not working

You can use goquery to parse and modify your HTML

Related

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

xpath how to skip a node

<article class='article-contents'>
<div class='summary'>xxxx</div>
<p>xxxxxx</p>
<table>...</table>
<p>....</p>...
</article>
I have a html structure like above, i'd like to skip pass <div class='summary'> and get the whole content inside article section using Xpath structure.
You could use a query like this:
//article[#class='article-contents']/node()[not(local-name()='div' and #class='summary')]
This should select all child nodes of the article excluding the summary div.

How to check of a Node is inside a form tag?

Using XPath, how do I determine if a node is within a form tag? I guess I am trying to locate the form tag of its ancestor/preceding (but I couldn't get it to work).
example 1:
<form id="doNotKnowIDofForm">
<div id="level1">
<span id="mySpan">someText</span>
</div>
</form>
example 2:
<form id="doNotKnowIDofForm">
This is a closed form.
</form>
<div id="level1">
<span id="mySpan">someText</span>
</div>
</form>
I can use xpath "//span[id='mySpan']" to locate the span node. But I would like to know if mySpan is inside a form (I do not know the id of the form). I have tried "//span[id='mySpan']/preceding::form/" and "//span[id='mySpan']/ancestor::form/"
Thanks in advance.
EDIT: I would like the XPath to select the myForm form tag in Example1 but NOT in Example2
I'm not 100% sure from your description whether you're looking to select the form element, or the span element. It seems more likely that you're going for the form, so I'll address that first.
Your XPath with the ancestor::form would have been ok if it didn't have the slash at the end, but it's more roundabout than it needs to be. I think this is a better way:
//form[.//span/#id = 'mySpan']
or this:
//form[descendant::span/#id = 'mySpan']
To produce an XPath that locates certain nodes only if they are within a form, you would put the ancestor::form inside the predicate:
//span[#id = 'mySpan' and ancestor::form]
or you can do this, which would again be more straightforward:
//form//span[#id = 'mySpan']
Your own attempt
//span[id='mySpan']/ancestor::form/
looks fine to me.
You can simply use,
"form//span[id='mySpan']"

Selenium WebDriver ruby accessing span value

What is the best way to to select this element using Selenium WebDriver?
I am trying to access the <span> element through the class mapResultNumber. This is the actual HTML:
<div class="mapResultInner">
<div class="mapResultNumber">
<span>4</span>
</div>
You could use xpath = //div[#class='mapResultNumber']/span
Using a css selector is a more readable way:
element = #driver.find_element(:css => "div.mapResultNumber span")
A dot (period) after a tag indicates the class to select.
A single space after the first selector (i.e. "div.mapResultNumber") indicates the next tag will be found inside the previous
You could also use div.mapResultNumber > span to indicate that the span tag is found directly beneath the div

Need php function to pull value from string

The string
<div id="main">
content (is INT)
<div>some more content (is not INT) other content (also INT)</div>
</div>
I need to get the content which is an INT. A simple strip all non-INT function will not work since other contentsometimes also is an INT. I cannot use a select child solution since it is always outside div and to select the content of <div id="main">will also select the other div.
Thus is there a solution that can search the string from start for the first <and remove the rest of the string when found.
(The structure cannot be altered)
if that's the exactly format, you could just use substr and strpos
something like
$html = '<div id="main">
12345
<div>foobar6789</div>
</div>
';
$content_1 = substr($html,15,strpos($html,'<div>')-15); //the first INT content
$subdiv = str_replace("</div>","",substr($html,strpos($html,'<div>')+5));
preg_match('/(?P<noint>[^0-9]+)(?P<digit>\d+)/', $subdiv, $matches);
echo $matches['noint'];//the NO INT content
echo $matches['digit'];//the second INT
it's not a good idea to parse html using regexp... but maybe you could do it using only preg_match...
good luck!

Resources