Saving files to separate html file before using grep/sed - bash

I'm working on a project that lets me navigate some urls. Right now I have:
#!/bin/bash
for file in $1
do
wget $1 >> output.html
cat output.html | grep -o '<a .*href=.*>' |
sed -e 's/<a /\n<a /g' |
sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' |
grep 'http'
done
I want the user to be able to run the script as follows:
./navigator google.com
which will save the source of url into a new html file, which will then run my grep/seds and then save to a new file.
Right now I'm struggling with saving the url into a new html file. Help!

To create a new file for each URL, use url in your output filename for wget -O option:
#!/bin/bash
for url; do
out="output-$url.html"
wget -q "$url" -O "$out"
grep -o '<a .*href=.*>' "$out" |
sed -e 's/<a /\n<a /g' |
sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d' |
grep 'http'
done
PS: As per comments above, added -q in wget to make it totally quiet.

Related

quoting in bash with ssh and grep

I don't get why this doesn't work:
filesToInclude="$(ssh -t $host ls -t /var/log/*.LOG | sort | egrep -A6 "$LastBootUp" | tr '\n' '[:space:]' | tr -s [:space:] ' ')"
allALL="$( ssh $host grep -Ev "$excludeSearch" $filesToInclude )"
on another server, which is capable of ag this works totally fine.
if I copy the output of filesToInclude to $filesToInclude manually, it works.
that is the output:
grep: o such file or directory
bash: 0m/var/log/A-MINI_23311_H007164M49_220419_1906_XX.LOG: No such file or directory

Sorted ouput, needs to have text inserted between string

I trying to add text (predefined) between a sorted output and saved to a new file.
I'm using a curl command to gather my info.
$ curl --user XXX:1234!## "http://......"
Then using grep to find IP addresses and sorting so they only appear once.
$ curl --user XXX:1234!## "http://......" | grep -E -o -m1 '([0-9]{1,3}[\.]){3}[0-9]{1,3}' | sort -u
I need to add <my_text_predefined> ([0-9]{1,3}[\.]){3}[0-9]{1,3} <my_text_predefined> between the regex ip address and then saved to a new file.
The script below only get my the ip address
$ curl --user XXX:1234!## "http://......" | grep -E -o -m1 '([0-9]{1,3}[\.]){3}[0-9]{1,3}' | sort -u
123.12.0.12
123.56.98.76
$ curl --user some_user:password "http://...." | grep -E -o -m1 '([0-9]{1,3}[\.]){3}[0-9]{1,3}' | sort -u | sed 's/.*/<prefix> -s & <suffix>/'
So if we need print some text for each IP ... try xargs
for i in {1..100}; do echo $i; done | xargs -n1 echo "Values are:"
if based on IP you would need to take decision put in a loop
for file $(curl ...) do ...
and check $file or do something with it ...

Bash: Parse Urls from file, process them and then remove them from the file

I am trying to automate a procedure where the system will fetch the contents of a file (1 Url per line), use wget to grab the files from the site (https folder) and then remove the line from the file.
I have made several tries but the sed part (at the end) cannot understand the string (I tried escaping characters) and remove it from that file!
cat File
https://something.net/xxx/data/Folder1/
https://something.net/xxx/data/Folder2/
https://something.net/xxx/data/Folder3/
My line of code is:
cat File | xargs -n1 -I # bash -c 'wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "#" -P /mnt/USB/ && sed -e 's|#||g' File'
It works up until the sed -e 's|#||g' File part..
Thanks in advance!
Dont use cat if it's posible. It's bad practice and can be problem with big files... You can change
cat File | xargs -n1 -I # bash -c
to
for siteUrl in $( < "File" ); do
It's be more correct and be simpler to use sed with double quotes... My variant:
scriptDir=$( dirname -- "$0" )
for siteUrl in $( < "$scriptDir/File.txt" )
do
if [[ -z "$siteUrl" ]]; then break; fi # break line if him empty
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "$siteUrl" -P /mnt/USB/ && sed -i "s|$siteUrl||g" "$scriptDir/File.txt"
done
#beliy answers looks good!
If you want a one-liner, you can do:
while read -r line; do \
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf \
--no-parent --restrict-file-names=nocontrol --user=test \
--password=pass --no-check-certificate "$line" -P /mnt/USB/ \
&& sed -i -e '\|'"$line"'|d' "File.txt"; \
done < File.txt
EDIT:
You need to add a \ in front of the first pipe
I believe you just need to use double quotes after sed -e. Instead of:
'...&& sed -e 's|#||g' File'
you would need
'...&& sed -e '"'s|#||g'"' File'
I see what you trying to do, but I dont understand the sed command including pipes. Maybe some fancy format that I dont understand.
Anyway, I think the sed command should look like this...
sed -e 's/#//g'
This command will remove all # from the stream.
I hope this helps!

curl complex usage with pattern

I'm trying to get 2 files using curl based on some pattern but that doesn't seem to work:
Files:
SystemOut_15.04.01_21.12.36.log
SystemOut_15.04.01_15.54.05.log
curl -f -k -u "login:password" https://myserver/cgi-bin/logviewer/index.cgi?getlogfile=SystemOut_15.04.01_21.12.36.log'&'server=qwerty123.com'&'numlines=100000000'&'appenv=MBL%20-%20PROD'&'directory=/app/WAS/was85/profiles/node/logs/mbl-server1
I know there is -A key but it doesn't work since my file is inside the link.
How can I extract those 2 files using a pattern?
Did that myself. One curl gets the list of logs on the webpage. Another downloads those files.
The code looks like:
for file in $(curl -f -k -u "user:pwd" https://selfservice.pwj.com/cgi-bin/logviewer/index.cgi?listdirectory=/app/smx_client_mob/data/log'&'appenv=MBL%20-%20PROD'&'server=xshembl04pap.she.pwj.com | grep href | sed 's/.*href="//' | sed 's/".*//' | sed 's/javascript:getLog//g' | sed "s/['();]//g" | grep -i 'service' | grep '^[a-zA-Z].*'); do
curl -o $file -f -k -u "user:pwd" https://selfservice.pwj.com/cgi-bin/logviewer/index.cgi?getlogfile="$file"'&'server=xshembl04pap.she.pwj.com'&'numlines=100000000'&'appenv=MBL%20-%20PROD'&'directory=/app/smx_client_mob/data/log; done

Wget page title

Is it possible to Wget a page's title from the command line?
input:
$ wget http://bit.ly/rQyhG5 <<code>>
output:
If it’s broke, fix it right - Keeping it Real Estate. Home
This script would give you what you need:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.
This might be a little better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but it does not fit your case as your page contains the following head opening:
<head profile="http://gmpg.org/xfn/11">
Again, this might be better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but there is still ways to break it, including no head/title in the page.
Again, a better solution might be:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.
The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'
Update:
As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'
Update 2:
As above still not working on Mac, try:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'
and/or
cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -f script
(Note the \ before the $ to avoid variable expansion.)
It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.
The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.
lynx -dump example.com | sed '2q;d'

Resources