Using GREP to find table tags - bash

I'm trying to search through a large directory for any .html files that contain any <table> tags. The grep command seems to be the most appropriate, but I'm having some trouble nailing down the parameters to pass.
Currently I have: grep -r -l "^<table>$" /directory_to_search_through
I used -r to recursively search through all files and -l to print only the file names. However, the current string specification searches exclusively for <table>, but I want to do a more comprehensive search that includes any table tags that include ids, classes, etc. Additionally, I want to search through only .html files, but specifying the directory as /directory/*.html yields a 'No such file or directory' message. Any help would be much appreciated.

To do this reliably you really need to use a bona fide HTML parser. If it's xhtml then an XML parser would be fine, too.
You could get a good approximation of your desired results with something like this:
find /directory/to/search -name '*.html' | xargs grep -l '<table[ \t>]'
That will check all the .html files in the directory tree rooted at /directory/to/search, identifying those that contain (the beginning of) a <table> start tag, anywhere on the line, but it can also identify false positives such the text <table inside a CDATA section (if in fact the file contains XHTML).

As you have already discovered, grep is not the ideal tool for the job. If your input is well-formed XHTML, you could use an XML parser such as xmlstarlet:
xmlstarlet sel -t -m //table -f -o " table id:" -v "#id" -o " class:" -v "#class" -n *.html
This simply selects all <table> elements and extracts their id, class and the name of the file that they were found in.
For example:
$ cat file.html
<html>
<body>
<table id="abc" class="something">
</table>
</body>
</html>
$ cat file2.html
<html>
<body>
<table id="def" class="something-else">
</table>
</body>
</html>
$ xmlstarlet sel -t -m //table -f -o " table id:" -v "#id" -o " class:" -v "#class" -n *.html
file.html table id:abc class:something
file2.html table id:def class:something-else

Related

Bash script that downloads an RSS feed and saves each entry as a separate html file

I'm trying to create a bash script that downloads an RSS feed and saves each entry as a separate html file. Here's what I've been able to create so far:
curl -L https://news.ycombinator.com//rss > hacke.txt
grep -oP '(?<=<description>).*?(?=</description>)' hacke.txt | sed 's/<description>/\n<description>/g' | grep '<description>' | sed 's/<description>//g' | sed 's/<\/description>//g' | while read description; do
title=$(echo "$description" | grep -oP '(?<=<title>).*?(?=</title>)')
if [ ! -f "$title.html" ]; then
echo "$description" > "$title.html"
fi
done
Unfortunately, it doesn't work at all :( Please suggest me where my mistakes are.
Please suggest me where my mistakes are.
Your single mistake is trying to parse XML with regular expressions. You can't parse XML/HTML with RegEx! Please use an XML/HTML-parser like xidel instead.
The first <item>-element-node (not "variable" as you call them):
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]' \
--output-node-format=xml --output-node-indent
<item>
<title>Show HN: I made an Ethernet transceiver from logic gates</title>
<link>https://imihajlov.tk/blog/posts/eth-to-spi/</link>
<pubDate>Sun, 18 Dec 2022 07:00:52 +0000</pubDate>
<comments>https://news.ycombinator.com/item?id=34035628</comments>
<description><a href="https://news.ycombinator.com/item?id=34035628">Comments</a></description>
</item>
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]/description'
Comments
Note that while the output of the first command is XML, the output for the second command is ordinary text!
With the integrated EXPath File Module you could then save this text(!) to an HTML-file:
$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write-text(
replace(title,"[<>:"/\\\|\?\*]",())||".html", (: remove invalid characters :)
description
)
'
But you can also save it as proper HTML by parsing the <description>-element-node and using file:write() instead:
$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write(
replace(title,"[<>:"/\\\|\?\*]",())||".html",
parse-html(description),
{"indent":true()}
)
'
$ xidel -s "Show HN I made an Ethernet transceiver from logic gates.html" -e '$raw'
<html>
<head/>
<body>
Comments
</body>
</html>

How to quickly add a line to all html files after a line with a given string

In Short
Is there some way to bash command line: echo "***some code***" >> "***all files of some type***" right after the line in the files that includes a certain string such as <body?
In Long
I have to add google analytics to a site. Its got an archive of stuff that goes by a few years.
For some recent archived years I just added :
...
</head>
<body<?php if ($onpageload != "") { echo " onload=\"$onpageload\""; }?>>
<?php include_once("analyticstracking.php") ?>
...
into the common_header.php file
For some of the early stuff they didn't make a common_header.php file. So, for the folders that manage years that don't have common headers there are alot of html files that need this line:
<?php include_once("analyticstracking.php") ?>
Is there some way to bash command line: echo "<?php include_once("analyticstracking.php") ?>" >> "*.html" right after the in the file that includes a certain string such as <body
The site is managed with php, python, and html (with some javascript css etc).
I am on ubuntu. Also note I am newer to this kind of webpage so if my question is somehow a wrong or a stupid question, please just let me know.
If you, for example, want to add a line before the line containing </body>, editing all files in place, you can
sed -i -e '\#</body>#i\New stuff' *.html
The same, but after the line matching <!-- insert here -->
sed -i -e '/<!-- insert here -->/a\New stuff' *.html
It is rather cumbersome to span multiple lines in sed with an s command. The line-oriented commands are i\ (insert before) and a\ (insert after).
Your particular case would be
sed -i -e '/<body/a\<?php include_once("analyticstracking.php") ?>' *.html
but only if your body tag does not have attributes spanning multiple lines. It is always problematic to edit html/xml files with text tools which are not aware of their structure, but it can work if you are sure of the actual text in the files you are editing.
If it's bash you're looking for then I'd recommend sed -e 's/<body/<body <blabla>/g' [html FILE]. If the file contents is <body> then output would be <body <blabla>>. You just substitute blabla with php code.
syntax is like
sed -e 's/[search]/[replace]/g' [file name]
where -e is 'extended' flag and g I think that stands for 'return after first match' but I'm not sure so you can as well drop it.
And about many files here's how you do it: Change multiple files

Parsing sub-element of enclosure tag from XML with bash script

I am trying to extract a URL to a file from an <enclosure> tag in an XML file. The issue is that the order of the sub-elements in the <enclosure> tags seems to change. Normally it looks like this:
<enclosure length="3026587648" url="2015-0805.mpeg" type="video/mpeg" />
But sometimes the URL comes first, which means using cut -f is not reliable.
I have come as far as to get the entire enclosure tag with grep -m 1 "enclosure", and the URL with cut -d " " -f 3.
But there must be a better way to extract the URL, regardless of where it appears?
I'm currently on a Slackware installation and xmllint and xmlstarlet doesn't seem to be available.
Thanks for any feedback!!
You can use this sed:
grep -m 1 "enclosure" yourfile.txt | sed -n 's/^.*\(url="[^"]*"\).*$/\1/p'

How can I find text after some string over bash

I have this bash script and works
DIRECTORY='1.20_TRUNK/mips-tuxbox-oe1.6'
# Download html page and save to tmp folder to ump.tmp file
wget -O 'ump.tmp' 'http://download.oscam.cc/index.php?&direction=0&order=mod&directory=$DIRECTORY&'
ft='index.php?action=downloadfile&filename=oscam-svn'
st='-webif-Distribution.tar.gz&directory=$DIRECTORY&'
File ump.tmp containts e.g. three links
I need find solution for find only number 10082 in first "a" links of the page. But this number is amended. When you run the script e.g per month, it may be different
I do not have the "cat" command. I have receiver and not linux. Receiver have enigma system and "cat" isnĀ“t implemented
I tested through comparison "sed", but it does not work.
sed -n "/filename=oscam-svn/,/-mips-tuxbox-webif/p" ump.tmp
Using a proper XHTML parser :
$ xmllint --html --xpath '//a/#href[contains(., "downloadfile")]' ump.tmp 2>/dev/null |
grep -oP "oscam-svn\K\d+"
But there's not this string in the given HTML file
"Find" is kind of vague, but you can use grep to get the link with the number 10082 in it from the temp file.
$ grep "10082" ump.tmp

Extract .co.uk urls from HTML file

Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that?
pd: im learning bash
edit:
code sample:
32
<tr><td id="Table_td" align="center">23<a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!
The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links (a text-based web browser) actually retrieves the site.
Using -dump causes the rendered page to be emitted to stdout.
Using -html-numbered-links requests a numbered table of links.
Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.
One way using awk:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH
Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.
If you do not have wget, then you can get it here

Resources