I have a file name foo. That file contains some text (shown below). Can you please tell me how can I get the string "I have not created a home page." into a variable. I was using the command variable='cat foo | cut -d ">" -f 3'. It output "I have not created a home page." with lots of new lines in it. Please let me know if you can tell me a way to obtain the string without any newlines. Thanks a lot.
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html>
<META HTTP-EQUIV="resource-type" CONTENT="document">
</HEAD>
<BODY>
I have not created a home page.
</BODY>
</HTML>
cut is the wrong tool. Use awk:
cat >> _.awk << "EOF"
/<BODY>/ { found=1; next }
/<\/BODY>/ && found==1 { exit 0 }
found==1 { if ($1) print $0 }
EOF
awk -f _.awk foo
Ideally you should use a real XML parser like a DOM parser
cat foo | grep "^[^<]". To assign a variable:
v=`cat foo | grep "^[^<]"`
{ xmlstarlet sel -N html='http://www.w3.org/1999/xhtml' -t -m //html:body -v . <(tidy -asxml input.html) | tr -d '\n' ; } 2> /dev/null
Related
I'm wgeting a webpage src code then using pup to grab the <meta> tag that I need. Now I want to print only the value of the content field.
In this case, the output I want is: https://example.com/my/folder/first.jpg?foo=bar
# wget page to /tmp/output.html
IMAGE_URL=$(cat /tmp/output.html | pup 'meta[property*="og:image"]')
echo $IMAGE_URL is:
<meta property="og:image" content="https://example.com/my/folder/first.jpg?foo=bar">
wget -O /tmp/output.html --user-agent="user-agent: Whatever..." https://example.com/somewhere
IMAGE_URL=$(cat /tmp/output.html | pup --plain 'meta[property*="og:image"]' | sed -n 's/.*content=\"\([^"]*\)".*/\1/p')
You can use attr{content} to only get the content of the attribute.
wget -O /tmp/output.html --user-agent="user-agent: Whatever..." https://example.com/somewhere
IMAGE_URL=$(cat /tmp/output.html | pup 'meta[property*="og:image"] attr{content}'
In the middle of doing cat <<, if we invoke a bash function that uses cat << as well, the indentation is only inherited for the first line.
This is better explained using a simple example script:
#!/bin/bash
write_multiple_lines() {
cat <<_EOF_
<h1>Header</h1>
<p>Paragraph</p>
_EOF_
return
}
cat << _EOF_
<html>
$(write_multiple_lines)
</html>
_EOF_
The result is as follows (the <p> doesn't follow <h1>'s indentation).
<html>
<h1>Header</h1>
<p>Paragraph</p>
</html>
while the desired result is
<html>
<h1>Header</h1>
<p>Paragraph</p>
</html>
I was expecting the indentation would be inherited if cat << is used. Is there any workaround for this (other than manually adding indentation to subsequent lines as pointed out by #bob dylan in the comment)?
The only way to 'preserve' it is to change your input file. The reason why <p> is indented is because you've indented it here:
$(write_multiple_lines)
Since you don't want to change your input e.g.
write_multiple_lines() {
cat <<_EOF_
<h1>Header</h1>
<p>Paragraph</p>
_EOF_
return
}
You could change it to echo the spaces for you and then print each line e.g.
#!/bin/bash
write_multiple_lines() {
while read p; do
echo " " "$p"
done <<_EOF_
<h1>Header</h1>
<p>Paragraph</p>
_EOF_
return
}
cat << _EOF_
<html>
$(write_multiple_lines)
</html>
_EOF_
output:
<html>
<h1>Header</h1>
<p>Paragraph</p>
</html>
Though this is less dynamic / obvious then if you formatted it verbatim so I'd stick by my original suggestion before doing something like this.
My Template.html contains two <pre> tags to which content from two different files needs to be inserted. The following inserts file content for all matches. How to insert only into 1st or 2nd <pre> tag?
sed -i -e '/<pre>/r file1.txt' Template.html
Template.html:
<html>
<body>
<h1>
<pre>
</pre>
<div>
<pre>
</pre>
</body>
</html>
file1.txt
hello
world
file2.txt
may
june
Expected Result:
<html>
<body>
<h1>
<pre>
hello
world
</pre>
<div>
<pre>
may
june
</pre>
</body>
</html>
sed is for doing simple s/old/new, that is all. It sounds like what you want would be something like this (set tgt to 1 or 2 or whichever <pre> you want the block to be inserted after):
awk -v tgt=1 '
NR==FNR { rec = rec $0 ORS; next }
{ print }
/<pre>/ && (++cnt == tgt) { printf "%s", rec }
' file1.txt Template.html
but with neither an example of file1.txt nor the expected output it's just an untested guess.
This might work for you (GNU sed):
sed -e '/<pre>/{x;s/^/x/;/^x\{1\}$/{x;r file1.txt' -e 'x};x}' Template.html
On encountering a line with the required tag, increment a counter in the hold space.
If the counter matches the required number (in this case 1) append the text file.
Thus the following will append the file after the third occurrence of the tag.
sed -e '/<pre>/{x;s/^/x/;/^x\{3\}$/{x;r file1.txt' -e 'x};x}' Template.html
I am trying to make a bash script that will download a youtube page, see the latest video and find its url. I have the part to download the page except I can not figure out how to isolate the text with the url.
I have this to download the page
curl -s https://www.youtube.com/user/h3h3Productions/videos > YoutubePage.txt
which will save it to a file.
But I cannot figure out how to isolate the single part of a div.
The div is
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Why I'm Unlisting the Leafyishere Rant" aria-describedby="description-id-877692" data-sessionlink="ei=a2lSV9zEI9PJ-wODjKuICg&feature=c4-videos-u&ved=CD4QvxsiEwicpteI1I3NAhXT5H4KHQPGCqEomxw" href="/watch?v=q6TNODqcHWA">Why I'm Unlisting the Leafyishere Rant</a>
And I need to isolate the href at the end but i cannot figure out how to do this with grep or sed.
With sed :
sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\([^"]*\)".*/\1/p' YoutubePage.txt
To just extract the video ahref :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt
/watch?v=q6TNODqcHWA
/watch?v=q6TNODqcHWA
/watch?v=ix4mTekl3MM
/watch?v=ix4mTekl3MM
/watch?v=fEGVOysbC8w
/watch?v=fEGVOysbC8w
...
To omit repeated lines :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt | sort | uniq
/watch?v=2QOx7vmjV2E
/watch?v=4UNLhoePqqQ
/watch?v=5IoTGVeqwjw
/watch?v=8qwxYaZhUGA
/watch?v=AemSBOsfhc0
/watch?v=CrKkjXMYFzs
...
You can also pipe it to your curl command :
curl -s https://www.youtube.com/user/h3h3Productions/videos | sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' | sort | uniq
You can use lynx which is a terminal browser, but have a -dump mode which will output a HTML parsed text, with URL extracted. This makes it easier to grep the URL:
lynx -dump 'https://www.youtube.com/user/h3h3Productions/videos' \
| sed -n '/\/watch?/s/^ *[0-9]*\. *//p'
This will output something like:
https://www.youtube.com/watch?v=EBbLPnQ-CEw
https://www.youtube.com/watch?v=2QOx7vmjV2E
...
Breakdown:
-n ' # Disable auto printing
/\/watch?/ # Match lines with /watch?
s/^ *[0-9]*\. *// # Remove leading index: " 123. https://..." ->
# "https://..."
p # Print line if all the above have not failed.
'
I've been trying to get this function to work without returning errors, but so far I'm unable to figure out what the problems is. I'm using $(report_home_space) to insert the contents of the functions on a small bit of hmtl but keep getting the error: report_home_space: command not found on line 30.
report_home_space () {
cat <<- _EOF_
<H2>Home Space Utilization</H2>
<PRE>$(du -sh /home/*)</PRE>
_EOF_
}
I'm new to shell scripting, but I can't not find anything wrong with the syntax of the function, and the spelling seems correct. Thanks in advance.
Full script is:
#!/bin/bash
# Program to output a system information page
TITLE="System Information Report For $HOSTNAME"
CURRENT_TIME=$(date +"%x %r %z")
TIMESTAMP="Generated $CURRENT_TIME, by $USER"
report_uptime () {
cat <<- _EOF_
<H2>system Uptime</H2>
<PRE>$(uptime)</PRE>
_EOF_
}
report_disk_space () {
cat <<- _EOF_
<H2>Disk Space Ulitilizatoin</H2>
<PRE>$(df -h)</PRE>
_EOF_
}
report_home_space () {
cat <<- _EOF_
<H2>Home Space Utilization</H2>
<PRE>$(du -sh /home/*)</PRE>
_EOF_
}
cat << _EOF_
<HTML>
<HEAD>
<TITLE>$TITLE</TITLE>
<BODY>
<H1>$TITLE</H1>
<P>$TIMESTAMP</P>
$(report_uptime)
$(report_disk_space)
$(report_home_space)
</BODY>
<HTML>
_EOF_
BTW, your script works fine. Did you by any chance type it up in a Windows environment before uploading to a UNIX env?
Try running:
dos2unix script.sh
What this does is change line endings from from Windows to unix format. i.e. it strips \r (CR) from line endings to change them from \r\n (CR+LF) to \n (LF).
Also, on a HTML note, you're missing a closing tag for "< HEAD >" after your title tags.
You can also do "od -c filename" or "grep pattern filename | od -c"
to see if there are any hidden characters in there.