Nokogiri: how to parse text fragment?

Nokogiri: how to parse text fragment? - ruby

I have such example:
html= <<EOT
<div>Some text1
<p>Some text2</p>
</div>
EOT
doc = Nokogiri::HTML(html)
puts doc.css('div').text
This makes:
Some text1
Some text2
But i need "Some text1" only

doc.css('div').children.first.text
# => "Some text1\n "
doc.css('div').children.first.text.rstrip
# => "Some text1"

One XPath expression and a strip will get you there:
some_text1 = doc.xpath('//div/text()[1]').text.strip

Related

Xpath get text with <br>

Lets say that I have html:
<div class="html">
<div class="offer">
text1
<br>
text2
<br>
text3
<br>
text4
<br>
text5
</div>
</div>
Im trying to get full text with xpath //div[#class='offer']/text(), but in result I havetext1.
I try to use it with [preceding-sibling::br], but result is the same.
What I need?
text1 text2 text3 text4 text5

I fix it using extract() method.

Replace token in a file with any content

As a final result I want to copy several lines of text from file input.html to output.html.
input.html
<body>
<h1>Input File</h1>
<!-- START:TEMPLATES -->
<div>
<p>Lorem Ipsum & Lorem Ipsum</p>
<span>Path: /home/users/abc.txt</span>
</div>
<!-- END:TEMPLATES -->
<body>
template.html
<body>
<h1>Template File</h1>
<!-- INSERT:TEMPLATES -->
<p>This is a Text with & /</p>
<body>
I tried different things in Powershell and Bash to get this work done. But not with success.
Getting the input into a variable is successfuly done by:
content="$(sed -e '/BEGIN:TEMPLATES/,/END:TEMPLATES/!d' input.html)"`
But to replace in another file is impossible. I tried sed and awk. Both habe a lot of problems if the variable contains any special character like & /, ...
output.html
<body>
<h1>Output File</h1>
<div>
<p>Lorem Ipsum & Lorem Ipsum</p>
<span>Path: /home/users/abc.txt</span>
</div>
<p>This is a Text with & /</p>
<body>
Thank you for any inputs that helps solving my problem.

If the START/END comments are on separate lines I'd build a simple parser for the input file like this:
$inTemplate = $false
$template = switch -Wildcard -File '.\input.html' {
'*<!-- START:TEMPLATES -->*'{
$inTemplate = $true
}
'*<!-- END:TEMPLATES -->*'{
$inTemplate = $false
}
default{
if($inTemplate){
$_
}
}
}
Now we can do the same thing for the template file:
$output = switch -Wildcard -File '.\template.html' {
'*<!-- INSERT:TEMPLATES -->*'{
# return our template input
$template
}
default{
# otherwise return the input string as is
$_
}
}
# output to file
$output |Set-Content output.html

Solution with awk.
get all line from file input.html between <!-- START:TEMPLATES --> and <!-- END:TEMPLATES --> stored it in an array insert_var.
In END section get template.html printed line by line in while loop. If line contain <!-- INSERT:TEMPLATES --> then print contents of array insert_var.
The output get redirected to output.html
As far as I know awk not messing with those special characters.
awk -v temp_file="template.html" '
BEGIN{input_line_num=1}
/<!-- END:TEMPLATES -->/{linestart=0}
{ if(( linestart >= 1)) {insert_var[input_line_num]=$0; input_line_num++}}
/<!-- START:TEMPLATES -->/{linestart=1}
END{ while ((getline<temp_file) > 0)
{if (( $0 ~ "<!-- INSERT:TEMPLATES -->"))
{for ( i = 1;i < input_line_num; i++) {print insert_var[i]}}
else { print } }}
' input.html > output.html

Extract string matching regex

I wanted to extract the value between "#" and ":" as well as after ":" within the following string:
str =
"this is some text
Text#7789347: 4444
some text
text # 7789348 : 666,555
some text
"
Output:
"7789347", " 4444"
"7789348", " 666,555"
I am using the following regex:
(\s)*[t|T][e|E][x|X][t|T](\s)*#(\s)*(\d)*(\s)*:.*
I can select the required field, but I don't know how to get the values.

In case you have to match only floating digits, you can use the /(?mi)^\s*\btext\b.*?#\s*(\d+(?:,\d+)?)\s*:\s*(\d+(?:,\d+)?)$/ regex:
str="""this is some text
Text#7789347: 4444
some text
text # 7789348 : 666,555
some text
"""
puts str.scan(/(?mi)^\s*\btext\b.*?#\s*(\d+(?:,\d+)?)\s*:\s*(\d+(?:,\d+)?)$/)
Output of the demo:
7789347
4444
7789348
666,555

You can scan it like this:
str.each_line{ |line|
a = line.scan(/#(.*):(.*)$/)
puts a[0].inspect if !a.empty?
}
# ["7789347", " 4444"]
# [" 7789348 ", " 666,555"]

To get the values you can use: #\s*(.*?)\s*:\s*(\d+(?:,\d+)*)
if line =~ /#\s*(.*?)\s*:\s*(\d+(?:,\d+)*)/
match1 = $~[1]
match2 = $~[2]
else
match = ""
end

Below Regex may help you:
#\s*(\d+)\s*:\s*([0-9,]*)
DEMO

Multiple occurrences in sed substitution

I am trying to retrieve some data within a specific div tag in my html file.
My current html code is in the following format.
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
When I try to extract tag in just the div with class2, using the bash code
sed -e ':a;N;$!ba
s/[[:space:]]\+/ /g
s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html
I get the output html file with the code as
some text some text </div> Some more text </div> Too much text
I want all the data after the first </div> to be removed but instead the final one is being replaced.
Can someone please elaborate my mistake.

You could do this in awk:
awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file
Between the lines that match /class2/ and /<\/div>/, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.
Instead of making an array, you could check for the first and last lines using a regular expression:
awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file

This works for retrieving text inside the div class = "class2" tags
#!/bin/bash
htmlcode='
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
'
echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'

extract pattern with text editors

I have a URL source page like:
href="http://path/to/file.bz2">german.txt.bz2</a> (2,371,487 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://a/web/page/">American cities</a></td>
<td><a rel="nofollow" class="external text" href="http://another/page/to.bz2">us_cities.txt.bz2</a> (77,081 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://other/page/to/file.bz2">test.txt.bz2</a> (7,158,285 bytes)</td>
<td>World's largest test password collection!<br />Created by <a rel="nofollow" class="external text" href="http://page/web.com/">Matt Weir</a>
I want use text editors like sed or awk in order to extract exactly pages that have .bz2 at the end of them...
like:
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Could you help me?

Sed and grep:
sed 's/.*href=\"\(.*\)\".*/\1/g' file | grep -oP '.*\.bz2$'

$ sed -n 's/.*href="\([^"]*\.bz2\)".*/\1/p' file
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2

Use a proper parser. For example, using xsh:
open :F html input.html ;
for //a/#href['bz2' = xsh:matches(., '\.bz2$')]
echo (.) ;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri: how to parse text fragment? - ruby

I have such example: html= <<EOT <div>Some text1 <p>Some text2</p> </div> EOT doc = Nokogiri::HTML(html) puts doc.css('div').text This makes: Some text1 Some text2 But i need "Some text1" only

doc.css('div').children.first.text # => "Some text1\n " doc.css('div').children.first.text.rstrip # => "Some text1"

One XPath expression and a strip will get you there: some_text1 = doc.xpath('//div/text()[1]').text.strip

Related

Xpath get text with <br>

Replace token in a file with any content

Extract string matching regex

Multiple occurrences in sed substitution

extract pattern with text editors

Categories

Resources