Wget page title - shell

Is it possible to Wget a page's title from the command line?
input:
$ wget http://bit.ly/rQyhG5 <<code>>
output:
If it’s broke, fix it right - Keeping it Real Estate. Home

This script would give you what you need:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.
This might be a little better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but it does not fit your case as your page contains the following head opening:
<head profile="http://gmpg.org/xfn/11">
Again, this might be better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but there is still ways to break it, including no head/title in the page.
Again, a better solution might be:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.
The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'
Update:
As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'
Update 2:
As above still not working on Mac, try:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'
and/or
cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -f script
(Note the \ before the $ to avoid variable expansion.)
It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.

The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.
lynx -dump example.com | sed '2q;d'

Related

how to pipe multi commands to bash?

I want to check some file on the remote website.
Here is bash command to generate commands that calculate the file md5
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\;
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce | md5sum;
Then I pipe the outputed commands to bash, but only the first command was executed.
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\; | bash -v
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
3d2f0e76e04444f4ec456ef9f11289ec -
[root]#
Would you please try the following instead:
while read -r _ _ url _; do
wget -q -O - "$url"e | md5sum
done < <(head -n 3 zrcpathAll)
we should not put -i in front of "$url" here.
[Explanation about -i option]
Manpage of wget says:
-i file
--input-file=file
Read URLs from a local or external file. [snip]
If this function is used, no URLs need be present on the command line. [snip]
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html.
Furthermore, the file's location will be implicitly used as base
href if none was specified.
where the file will contain line(s) of url such as:
https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce
https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce
https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce
Whereas if we use the option as -i url, wget first
downloads the url as a file which contains the lines of urls
as above. In our case, the url is the target to download itself,
not the list of urls, wget causes an error: No URLs found in url.
Even if the wget fails, why the command outputs just one line, not
three lines as the result of md5sum?
This seems to be because the head command immediately flushes the remaining
lines when the piped subprocess fails.

Is it possible to comment inline in a multi-line newline escaped script?

When I have long multi-line piped commands in my scripts I would like to comment on what each line does, but I haven't found a way of doing so.
Given this snippet:
git branch -r --merged \
| grep " $remote" \
| egrep -v "HEAD ->" \
| util.esed -n 's/ \w*\/(.*)/\1/p' \
| egrep -v \
"$(skipped $skip | util.esed -e 's/,/|/g' -e 's/(\w+)/^\1$/g' )" \
| paste -s
Is it possible to insert comments in between the lines? It seems that using the backslash to escape the newline prevents me from adding comments at the end of the line, and I can't add the comment before the backslash, as that would hide the escaping.
Pseudo-code of what I would like the above script to look like
It seems I was unclear (?) of what I wanted by the above section, so to have a clue on what I am looking for, it should be in the similar vein of this:
git branch -r --merged \ # list merged remote branches
| grep " $remote" \ # filter out the ones for $remote
| egrep -v "HEAD ->" \ # remove some garbage
#strip some whitespace:
| util.esed -n 's/ \w*\/(.*)/\1/p' \
# remove the skipped branches:
| egrep -v \
"$(skipped $skip | util.esed -e 's/,/|/g' -e 's/(\w+)/^\1$/g' )" \
| paste -s # something else
It doesn't have to be exactly like this (obviously, it's not valid syntax), but something similar. If it's not possible directly, due to syntactical restrictions, perhaps it's possible to write self-modifying code that will have comments that are removed before executing it?
You can try something like that:
git branch --remote | # some comment
grep origin | # another comment
tr a-z A-Z

Bash: Parse Urls from file, process them and then remove them from the file

I am trying to automate a procedure where the system will fetch the contents of a file (1 Url per line), use wget to grab the files from the site (https folder) and then remove the line from the file.
I have made several tries but the sed part (at the end) cannot understand the string (I tried escaping characters) and remove it from that file!
cat File
https://something.net/xxx/data/Folder1/
https://something.net/xxx/data/Folder2/
https://something.net/xxx/data/Folder3/
My line of code is:
cat File | xargs -n1 -I # bash -c 'wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "#" -P /mnt/USB/ && sed -e 's|#||g' File'
It works up until the sed -e 's|#||g' File part..
Thanks in advance!
Dont use cat if it's posible. It's bad practice and can be problem with big files... You can change
cat File | xargs -n1 -I # bash -c
to
for siteUrl in $( < "File" ); do
It's be more correct and be simpler to use sed with double quotes... My variant:
scriptDir=$( dirname -- "$0" )
for siteUrl in $( < "$scriptDir/File.txt" )
do
if [[ -z "$siteUrl" ]]; then break; fi # break line if him empty
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "$siteUrl" -P /mnt/USB/ && sed -i "s|$siteUrl||g" "$scriptDir/File.txt"
done
#beliy answers looks good!
If you want a one-liner, you can do:
while read -r line; do \
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf \
--no-parent --restrict-file-names=nocontrol --user=test \
--password=pass --no-check-certificate "$line" -P /mnt/USB/ \
&& sed -i -e '\|'"$line"'|d' "File.txt"; \
done < File.txt
EDIT:
You need to add a \ in front of the first pipe
I believe you just need to use double quotes after sed -e. Instead of:
'...&& sed -e 's|#||g' File'
you would need
'...&& sed -e '"'s|#||g'"' File'
I see what you trying to do, but I dont understand the sed command including pipes. Maybe some fancy format that I dont understand.
Anyway, I think the sed command should look like this...
sed -e 's/#//g'
This command will remove all # from the stream.
I hope this helps!

Bash script - grep output

I need an output for multiple grep commands.
patterns: ([^"#]+)
wget -q -O - http://www.site1.com | grep -o -E -m 1 'site1content = "([^"#]+)"'
wget -q -O - http://www.site2.com | grep -o -E -m 1 'site2content"([^"#]+)"
.........
Output file:
http://www.site1.com***pattern
http://www.site2.com***pattern
Just redirect the output of your commands to a file.
wget -q -O - http://www.site1.com | grep -o -E -m 1 'site1content = "([^"#]+)"' > output.txt
wget -q -O - http://www.site2.com | grep -o -E -m 1 'site2content"([^"#]+)"' >> output.txt
> overwrites old content and >> appends to the end of the file.
Edit:
Not pretty but a quick and dirty solution might be
echo 'http://www.site1.com***'`wget -q -O - http://www.site1.com | grep -o -E -m 1 'site1content = "([^"#]+)"'` > output.txt
(untested)
As is the output you got from the above commends consists only of the pattern found because of the -o parameter:
http://explainshell.com/explain?cmd=grep+-o
I suggest using the above site for an explanation.

How to get the file size on Unix in a Makefile?

I would like to implement this as a Makefile task:
# step 1:
curl -u username:password -X POST \
-d '{"name": "new_file.jpg","size": 114034,"description": "Latest release","content_type": "text/plain"}' \
https://api.github.com/repos/:user/:repo/downloads
# step 2:
curl -u username:password \
-F "key=downloads/octocat/Hello-World/new_file.jpg" \
-F "acl=public-read" \
-F "success_action_status=201" \
-F "Filename=new_file.jpg" \
-F "AWSAccessKeyId=1ABCDEF..." \
-F "Policy=ewogIC..." \
-F "Signature=mwnF..." \
-F "Content-Type=image/jpeg" \
-F "file=#new_file.jpg" \
https://github.s3.amazonaws.com/
In the first part however, I need to get the file size (and content type if it's easy, not required though), so some variable:
{"name": "new_file.jpg","size": $(FILE_SIZE),"description": "Latest release","content_type": "text/plain"}
I tried this but it doesn't work (Mac 10.6.7):
$(shell du path/to/file.js | awk '{print $1}')
Any ideas how to accomplish this?
If you have GNU coreutils:
FILE_SIZE=$(stat -L -c %s $filename)
The -L tells it to follow symlinks; without it, if $filename is a symlink it will give you the size of the symlink rather than the size of the target file.
The MacOS stat equivalent appears to be:
FILE_SIZE=$(stat -L -f %z)
but I haven't been able to try it. (I've written this as a shell command, not a make command.) You may also find the -s option useful:
Display information in "shell output", suitable for initializing variables.
For reference, an alternative method is using du with -b bytes output and -s for summary only. Then cut to only keep the first element of the return string
FILE_SIZE=$(du -sb $filename | cut -f1)
This should return the same result in bytes as #Keith Thompson answer, but will also work for full directory sizes.
Extra: I usually use a macro for this.
define sizeof
$$(du -sb \
$(1) \
| cut -f1 )
endef
Which can then be called like,
$(call sizeof,$filename_or_dirname)
I think this is a case where parsing the output of ls is legitimate:
% FILE_SIZE=`ls -l $filename | awk '{print $5}'`
(no it's not: use stat, as noted by Keith Thompson)
For the type, you can use
% FILE_TYPE=`file --mime-type --brief $filename`

Resources