Delete HTML code from log file in SHELL - bash

I have a log file containing html code i need to delete all the content between html tags for every possible match in this file. How is that possible using filters?
Example of my file:
some text here
<html>
code
</html>
some text there
<html>
code
</html>
some other text
The output should be:
some text here
some text there
some other text

This awk should do:
awk '/<html>/{f=1;next} !f; /<\/html>/{f=0}' file
some text here
some text there
some other text

why not just:
sed '/<html>/,/<\/html>/d'
it works for your example.

Related

Asciidoctor-pdf no parsing

First, thank you for this great resource. Beautiful pdf files it creates.
I have a bunch of text files with all kinds of text of which some are jebrish. Some text lines start with a dot, etc.
Asciidoctor-pdf barfs on many pages correctly so. I've spend days trying to clean the text files with sed but its a no end game.
Is there a way to tell Asciidoctor-pdf to simply convert the text document to pdf without parsing it with Asciidoctor-pdf command options?
You could create a new AsciiDoc file where you include the text files using the include macro. If you want the converter to ignore the syntax you should use a passthrough block. If you want to display a fileA.txt and fileB.txt inside an allfiles.adoc it could look like this:
allfiles.adoc
= all files
== content of fileA.txt
++++
include::fileA.txt[]
++++
== content of fileB.txt
++++
include::fileB.txt[]
++++

Pandoc: no line wrapping when converting to HTML

I am converting from Markdown to HTML like so:
pandoc --columns=70 --mathjax -f markdown input.pdc -t html -Ss > out.html
Everything works fine, except for the fact that the text doesn't get wrapped. I tried different columns lengths, no effect. Removed options, no go. Whatever I tried, the HTML just doesn't get wrapped. I search the bug tracker, but there don't seem to be any open bugs relating to this issue. I also checked the documentation, but as far as I could glean, the text ought be line-wrapped... So, have I stumbled into a bug?
I'm using pandoc version 1.12.4.2.
Thanks in advance for your help!
Pandoc puts newlines in the HTML so the source code is easier to read. By default, it doesn't insert <br>-tags.
If you want to preserve line breaks from markdown input:
pandoc -f markdown+hard_line_breaks input.md output.html
However, usually a better approach to limit the text width when opening the HTML file in the browser is to adapt the HTML template (pandoc -D html5) and add some CSS, like:
<!DOCTYPE html>
<html$if(lang)$ lang="$lang$"$endif$>
<head>
<style>
body {
width: 46em;
}
</style>
...
It is not clear what text should get wrapped but does not as you did not provide a sample.
Pandoc supports several line breaking scenarios in markdown documents.
What you may be looking for is the hard_line_breaks extension
If it is so then your command should look like
pandoc --columns=70 --mathjax -f markdown+hard_line_breaks input.pdc -t html -Ss > out.html
I'd recommend you to read about all the markdown-relevant options and configure pandoc to match your input markdown flavor

Using bash in order to extract data from a HTML forum list

I'm looking to create a quick script, but I've ran into some issues.
<li type="square"> Y </li>
I'm basically using wget to download a HTML file, and then trying to search the file for the above snippet. Y is dynamic and changes each time, so in one it might be "Dave", and in the other "Chris". So I'm trying to get the bash script to find
<li type="square"> </li>
and tell me what is inbetween the two. The general formatting of the file is very messy:
<html stuff tags><li type="square">Dave</li><more html stuff>
<br/><html stuff>
<br/><br/><li type="square">Chris</li><more html stuff><br/>
I've been unable to come up with anything that works for parsing the file, and would really appreciate someone to give me a push in the right direction.
EDIT -
<div class="post">
<hr class="hrcolor" width="100%" size="1" />
<div class="inner" id="msg_4287022"><ul class="bbc_list"><li type="square">-dave</li><li type="square">-chris</li><li type="square">-sarah</li><li type="square">-amber</li></ul><br /></div>
</div>
is the block of code that I'm looking to extract the names from. The "-" symbol is somethng added onto the list to minimize its scope, so I just get that list. The problem I'm having is that:
awk '{print $2}' FS='(<[^>]*>)+-' 4287022.html > output.txt
Only gives outputs the first list item, and not the rest.
You generally should not use regex to parse html files.
Instead you can use my Xidel to perform pattern matching on it:
xidel 4287022.html -e '<li type="square">{.}</li>*'
Or with traditional XPath:
xidel 4287022.html -e '//li[#type="square"]'
You could use grep -Eo "<li type=\"square\">-?(\w+)</li>" ./* for this.
Using sed:
sed -n 's/.*<li type="square"> *\([^<]*\).*/\1/p' input.html
awk '{print $2,$3,$4,$5}' FS='(<[^>]*>)+' 4287022.html
This presents the HTML page as a table. However instead of runs of whitespace as the Field Separator, runs of HTML tags are the Field Separator. The first field in this case is the empty space at the beginning of the line. The second field in the case is the Name, so we print this.
Result
-dave -chris -sarah -amber

chunking a file with awk or a shell script

this feels like it should be a simple task, but somehow can't wrap my brain around it. I have HTML files with headers from H1-H4. I would like to get the content between H3 tags. Not the text between <H3> and </H3> rather the text between two H3s.
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
...
Thank you in advance
I've been asked to describe a sample output, which I thought i did in a comment below. I will restate the same, and if something is not clear, please let me know.
input: long file with many H3 headings
output: many small files each containing a fragment that starts with the line containing an H3 heading, and ends on the line before the next H3 heading.
Without you posting your expected output we're just guessing but if you literally want the text between </H3> and <H3>, here's one way with GNU awk:
$ cat file
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" -v ORS= 'NR>1{print $NF}' file
<p> more text that I would like to grab</p>
<p> some more text that I'd like to get </p>
$
$ cat file
<H3>some text</H3><p>more text that I would like to grab</p><H3>some other text</H3><p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" -v ORS= 'NR>1{print $NF}' file
<p>more text that I would like to grab</p><p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" 'NR>1{print $NF}' file
<p>more text that I would like to grab</p>
<p> some more text that I'd like to get </p>
You need GNU awk for that so you can have a multi-character RS.
Note that when there are newlines included in the text between your blocks those are reproduced in the output just like any other characters.
If the above is not what you want, again, tell us more....
The problem is that HTML syntax is quite flexible. For example:
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
And
<H3>
some text
</H3>
<p>
more
text
that
I
would
like
to
grab</p>
<H3>
some other text
</H3>
<p>some more text that I'd like to get
</p>
Will produce the same output. Extra whitespace is stripped, and tags can be scattered all about. You can't simply look for a particular tag to know what you're after.
The only real way to do this is to use a full bred scripting language like Perl or Python that has modules that can parse and organize HTML formatted files for you. You can't parse HTML or XML with Unix's regular expressions.
Unfortunately, you've tagged this as bash, shell, or awk, and none of those can really handle HTML input in a clean manner.
As a start, this shell line will extract the first H3 to H3 section...
$ sed -e '1,/<H3/d' -e '/<H3/,$d'

How to extract text between particular HTML tag in script

Given that I have some HTML in the form:
<html>
<body>
<div id="1" class="c">some other html stuff</div>
</body>
</html>
How can I extract this with Unix script?
some other html stuff
You may checkout the html-xml-utils and the hxselect command which allows you to extract elements that match a CSS selector:
hxselect '.c' < test.htm
This assumes that your input is a well-formed XML document. If it is not you might need to resort to regular expressions and the possible consequences of that.
For simple uses, you can use Ex editor, for example:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -scq! file.html
some other html stuff
where it finds div tag, then selecting inner HTML tag (vit) of found tag, yank it (y) in order to replace the buffer with it (%delete, put 0), then print it (%print), and quit (-cq!).
Other example with demo URL:
$ ex +'/<div/norm vity' +'%d|pu 0|%p' -Nscq! http://example.com/
The advantage is that ex is a standard Unix editor available in most Linux/Unix distributions.
See also:
How to jump between matching HTML/XML tags? at Vim SE
How to remove inner content of html tag conditionally? at Vim SE

Resources