Bash: Parse Urls from file, process them and then remove them from the file - bash

I am trying to automate a procedure where the system will fetch the contents of a file (1 Url per line), use wget to grab the files from the site (https folder) and then remove the line from the file.
I have made several tries but the sed part (at the end) cannot understand the string (I tried escaping characters) and remove it from that file!
cat File
https://something.net/xxx/data/Folder1/
https://something.net/xxx/data/Folder2/
https://something.net/xxx/data/Folder3/
My line of code is:
cat File | xargs -n1 -I # bash -c 'wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "#" -P /mnt/USB/ && sed -e 's|#||g' File'
It works up until the sed -e 's|#||g' File part..
Thanks in advance!

Dont use cat if it's posible. It's bad practice and can be problem with big files... You can change
cat File | xargs -n1 -I # bash -c
to
for siteUrl in $( < "File" ); do
It's be more correct and be simpler to use sed with double quotes... My variant:
scriptDir=$( dirname -- "$0" )
for siteUrl in $( < "$scriptDir/File.txt" )
do
if [[ -z "$siteUrl" ]]; then break; fi # break line if him empty
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "$siteUrl" -P /mnt/USB/ && sed -i "s|$siteUrl||g" "$scriptDir/File.txt"
done

#beliy answers looks good!
If you want a one-liner, you can do:
while read -r line; do \
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf \
--no-parent --restrict-file-names=nocontrol --user=test \
--password=pass --no-check-certificate "$line" -P /mnt/USB/ \
&& sed -i -e '\|'"$line"'|d' "File.txt"; \
done < File.txt
EDIT:
You need to add a \ in front of the first pipe

I believe you just need to use double quotes after sed -e. Instead of:
'...&& sed -e 's|#||g' File'
you would need
'...&& sed -e '"'s|#||g'"' File'

I see what you trying to do, but I dont understand the sed command including pipes. Maybe some fancy format that I dont understand.
Anyway, I think the sed command should look like this...
sed -e 's/#//g'
This command will remove all # from the stream.
I hope this helps!

Related

how to pipe multi commands to bash?

I want to check some file on the remote website.
Here is bash command to generate commands that calculate the file md5
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\;
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce | md5sum;
Then I pipe the outputed commands to bash, but only the first command was executed.
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\; | bash -v
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
3d2f0e76e04444f4ec456ef9f11289ec -
[root]#
Would you please try the following instead:
while read -r _ _ url _; do
wget -q -O - "$url"e | md5sum
done < <(head -n 3 zrcpathAll)
we should not put -i in front of "$url" here.
[Explanation about -i option]
Manpage of wget says:
-i file
--input-file=file
Read URLs from a local or external file. [snip]
If this function is used, no URLs need be present on the command line. [snip]
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html.
Furthermore, the file's location will be implicitly used as base
href if none was specified.
where the file will contain line(s) of url such as:
https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce
https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce
https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce
Whereas if we use the option as -i url, wget first
downloads the url as a file which contains the lines of urls
as above. In our case, the url is the target to download itself,
not the list of urls, wget causes an error: No URLs found in url.
Even if the wget fails, why the command outputs just one line, not
three lines as the result of md5sum?
This seems to be because the head command immediately flushes the remaining
lines when the piped subprocess fails.

Pass multiline argument with xargs

I want to pass file content as quoted programm argument with xargs to skip temporary file creation.
With temp file I can do like this:
myprogram > /tmp/lld.json
zabbix_sender -z 127.0.0.1 -s testhost -k llditem -o "`cat /tmp/lld.json`"
rm /tmp/lld.json
But I don't want this extra actions with /tmp/lld.json.
So I try to use xargs like this:
myprogram |
xargs -e -I'{}' zabbix_sender -z 127.0.0.1 -s testhost -k llditem -o "'{}'"
guiding with xargs manpage:
-I replace-str
-e[eof-str] ... If eof-str is omitted, there is no end of file string..
http://man7.org/linux/man-pages/man1/xargs.1.html
But xargz executes zabbix-sender many times with each of the lines.
I guess that -I and -e options are mutually exclusive options. But I also assume that I misinterpret the xargs manual..
Would this work?
zabbix_sender -z 127.0.0.1 -s testhost -k llditem -o "`myprogram`"
If you insist on using xargs to do exactly that, then use -0:
myprogram | xargs -0 -I{} zabbix_sender -z 127.0.0.1 -s testhost -k llditem -o "'{}'"

Pipe grep response to a second command?

Here's the command I'm currently running:
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")'
The response from this command is a URL, like this:
$ curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")'
http://google.com
I want to use whatever that URL is to essentially do this:
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")' | curl 'http://google.com'
Is there any simple way to do this all in one line?
Use xargs with a place holder for the output from stdin with the -I{} flag as below. The -r flag is to ensure the curl command is not invoked on a empty output from previous grep output.
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")' | xargs -r -I{} curl {}
A small description about the flags, -I and -r from the GNU xargs man page,
-I replace-str
Replace occurrences of replace-str in the initial-arguments with
names read from standard input.
-r, --no-run-if-empty
If the standard input does not contain any nonblanks, do not run
the command. Normally, the command is run once even if there is
no input. This option is a GNU extension
(or) if you are looking for a bash approach without other tools,
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")' | while read line; do [ ! -z "$line" ] && curl "$line"; done

Bash script - Some commands don't work in sh file

I have some troubles with my bash script. The end of my file doesn't work but every commands work outside the file. I have two strings as argument $1 and $2. $acl_line and $usebackend_line are numbers and they are good.
Here is my end file :
sed -i "$((acl_line+1))i \ \tacl\t\t is_$2_$1\t\thdr_com(host)\t-i $2.$1" /my_doc/haproxy/haproxy.cfg
sed -i "$((usebackend_line+1))i \ \tuse_backend\t$2_$1\tif is_$2_$1" /my_doc/haproxy/haproxy.cfg
echo -en "\nbackend $2_$1\n\tserver $2_$1 163.172.167.52:$3 maxconn 1024" >> /my_doc/haproxy/haproxy.cfg
cp -r "./model/*" "./script/lp_domains/$1/$2/"
sed -i 's/lp_ports/$ports/g' "./script/lp_domains/$1/$2/my_doc.yml"
sed -i 's/lp_name/$2-$1/g' "./script/lp_domains/$1/$2/my_doc.yml"
Thanks for your anwsers :)
If $1 and $2 should be interpolated, you cannot use single quotes.
Moreover, copying a file and then running sed -i on it is wasteful and error-prone. Just run sed and perform your substitutions at the same time.
sed -i -e "$((acl_line+1))i \ \tacl\t\t is_$2_$1\t\thdr_com(host)\t-i $2.$1" \
-e "$((usebackend_line+1))i \ \tuse_backend\t$2_$1\tif is_$2_$1" /my_doc/haproxy/haproxy.cfg \
-e "\$a\
backend $2_$1\n\tserver $2_$1 163.172.167.52:$3 maxconn 1024" /my_doc/haproxy/haproxy.cfg
# remove ./model/my_doc.yml; instead have a template ./my_doc.yml.in
cp -r "./model/*" "./script/lp_domains/$1/$2/"
sed -e "s/lp_ports/$ports/g" -e "s/lp_name/$2-$1/g" \
my_doc.yml.in >"./script/lp_domains/$1/$2/my_doc.yml"
(You should probably do something similar with haproxy.cfg.in actually.)
I have fixed my errors. It was just permissions errors, Sed create some temporary files so i add permissions to my user. Thanks for your help !

Wget page title

Is it possible to Wget a page's title from the command line?
input:
$ wget http://bit.ly/rQyhG5 <<code>>
output:
If it’s broke, fix it right - Keeping it Real Estate. Home
This script would give you what you need:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.
This might be a little better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but it does not fit your case as your page contains the following head opening:
<head profile="http://gmpg.org/xfn/11">
Again, this might be better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but there is still ways to break it, including no head/title in the page.
Again, a better solution might be:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.
The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'
Update:
As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'
Update 2:
As above still not working on Mac, try:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'
and/or
cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -f script
(Note the \ before the $ to avoid variable expansion.)
It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.
The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.
lynx -dump example.com | sed '2q;d'

Resources