Unescape the ampersand (&) via XMLStarlet - Bugging & - bash

This a quite annoying but rather a much simpler task. According to this guide, I wrote this:
#!/bin/bash
content=$(wget "https://example.com/" -O -)
ampersand=$(echo '\&')
xmllint --html --xpath '//*[#id="table"]/tbody' - <<<"$content" 2>/dev/null |
xmlstarlet sel -t \
-m "/tbody/tr/td" \
-o "https://example.com" \
-v "a//#href" \
-o "/?A=1" \
-o "$ampersand" \
-o "B=2" -n \
I successfully extract each link from the table and everything gets concatenated correctly, however, instead of reproducing the ampersand as & I receive this at the end of each link:
https://example.com/hello-world/?A=1\&B=2
But actually, I was looking for something like:
https://example.com/hello-world/?A=1&B=2
The idea is to escape the character using a backslash \& so that it gets ignored. Initially, I tried placing it directly into -o "\&" \ instead of -o "$ampersand" \ and removing ampersand=$(echo '\&') in this case scenario. Still the same result.
Essentially, by removing the backslash it still outputs:
https://example.com/hello-world/?A=1&B=2
Only that the \ behind the & is removed.
Why?
I'm sure it is something basic that is missing.

& is the correct way to print & in an XML document, but since you just want a plain URL your output should not be XML. Therefore you need to switch to text mode, by passing --text or -T to the sel command.
Your example input doesn't quite work because example.com doesn't have any table elements, but here is a working example building links from p elements instead.
content=$(wget 'https://example.com/' -O -)
xmlstarlet fo --html <<<"$content" |
xmlstarlet sel -T -t \
-m '//p[a]' \
--if 'not(starts-with(a//#href,"http"))' \
-o 'https://example.com/' \
--break \
-v 'a//#href' \
-o '/?A=1' \
-o '&' \
-o 'B=2' -n
The output is
http://www.iana.org/domains/example/?A=1&B=2

As you have already seen, backslash-escaping isn't the solution here. I can think of two possible options:
Extract the hrefs (probably don't need to be using both xmllint and xmlstarlet to do this), then just use a standard text processing tool such as sed to add the start and the end:
sed 's,^,https://example.com/,; s,$,/?A=1\&B=2,'
Alternatively, pipe the output of what you've currently got to xmlstarlet unesc, which will change & into &.

Sorry I can't reproduce your result but why don't make substitutions? Just filter your results through
sed 's/\\&/\&/g'
add it to your pipe. It should replace all & to &.

Related

GNU parallel read from several files

I am trying to use GNU parallel to convert individual files with a bioinformatic tool called vcf2maf.
My command looks something like this:
${parallel} --link "perl ${vcf2maf} --input-vcf ${1} \
--output-maf ${maf_dir}/${2}.maf \
--tumor-id ${3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
VCF_files, results and tumor_ids contain one entry per line and correspond to one another.
When I try and run the command I get the following error for every file:
ERROR: Both input-vcf and output-maf must be defined!
This confused me, because if I run the command manually, the program works as intended, so I dont think that the input/outpit paths are wrong. To confirm this, I also ran
${parallel} --link "cat ${1}" :::: ${VCF_files} ${results} ${tumor_ids},
which correctly prints the contents of the VCF files, whose path is listed in VCF_files.
I am really confused what I did wrong, if anyone could help me out, I'd be very thankful!
Thanks!
For a command this long I would normally define a function:
doit() {
...
}
export -f doit
Then test this on a single input.
When it works:
parallel --link doit :::: ${VCF_files} ${results} ${tumor_ids}
But if you want to use a single command it will look something like:
${parallel} --link "perl ${vcf2maf} --input-vcf {1} \
--output-maf ${maf_dir}/{2}.maf \
--tumor-id {3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
GNU Parallel's replacement strings are {1}, {2}, and {3} - not ${1}, ${2}, and ${3}.
--dryrun is your friend when GNU Parallel does not do what you expect it to do.

Using wget in shell trouble with variable that has \

I'm trying to run a script for pulling finance history from yahoo. Boris's answer from this thread
wget can't download yahoo finance data any more
works for me ~2 out of 3 times, but fails if the crumb returned from the cookie has a "\" character in it.
Code that sometimes works looks like this
#!usr/bin/sh
symbol=$1
today=$(date +%Y%m%d)
tomorrow=$(date --date='1 days' +%Y%m%d)
first_date=$(date -d "$2" '+%s')
last_date=$(date -d "$today" '+%s')
wget --no-check-certificate --save-cookies=cookie.txt https://finance.yahoo.com/quote/$symbol/?p=$symbol -O C:/trip/stocks/stocknamelist/crumb.store
crumb=$(grep 'root.*App' crumb.store | sed 's/,/\n/g' | grep CrumbStore | sed 's/"CrumbStore":{"crumb":"\(.*\)"}/\1/')
echo $crumb
fileloc=$"https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
echo $fileloc
wget --no-check-certificate --load-cookies=cookie.txt $fileloc -O c:/trip/stocks/temphistory/hs$symbol.csv
rm cookie.txt crumb.store
But that doesn't seem to process in wget the way I intend either, as it seems to be interpreting as described here:
https://askubuntu.com/questions/758080/getting-scheme-missing-error-with-wget
Any suggestions on how to pass the $crumb variable into wget so that wget doesn't error out if $crumb has a "\" character in it?
Edited to show the full script. To clarify I've got cygwin installed with wget package. I call the script from cmd prompt as (example where the script above is named "stocknamedownload.sh, the stock symbol I'm downloading is "A" from the startdate 19800101)
c:\trip\stocks\StockNameList>bash stocknamedownload.sh A 19800101
This script seems to work fine - unless the crumb returned contains a "\" character in it.
The following implementation appears to work 100% of the time -- I'm unable to reproduce the claimed sporadic failures:
#!/usr/bin/env bash
set -o pipefail
symbol=$1
today=$(date +%Y%m%d)
tomorrow=$(date --date='1 days' +%Y%m%d)
first_date=$(date -d "$2" '+%s')
last_date=$(date -d "$today" '+%s')
# store complete webpage text in a variable
page_text=$(curl --fail --cookie-jar cookies \
"https://finance.yahoo.com/quote/$symbol/?p=$symbol") || exit
# extract the JSON used by JavaScript in the page
app_json=$(grep -e 'root.App.main = ' <<<"$page_text" \
| sed -e 's#^root.App.main = ##' \
-e 's#[;]$##') || exit
# use jq to extract the crumb from that JSON
crumb=$(jq -r \
'.context.dispatcher.stores.CrumbStore.crumb' \
<<<"$app_json" | tr -d '\r') || exit
# Perform our actual download
fileloc="https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
curl --fail --cookie cookies "$fileloc" >"hs$symbol.csv"
Note that the tr -d '\r' is only necessary when using a native-Windows jq mixed with an otherwise native-Cygwin set of tools.
You are adding quotes to the value of the variable instead of quoting the expansion. You are also trying to use tools that don't know what JSON is to process JSON; use jq.
wget --no-check-certificate \
--save-cookies=cookie.txt \
"https://finance.yahoo.com/quote/$symbol/?p=$symbol" \
-O C:/trip/stocks/stocknamelist/crumb.store
# Something like thist; it's hard to reverse engineer the structure
# of crumb.store from your pipeline.
crumb=$(jq 'CrumbStore.crumb' crumb.store)
echo "$crumb"
fileloc="https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
echo "$fileloc"
wget --no-check-certificate \
--load-cookies=cookie.txt "$fileloc" \
-O c:/trip/stocks/temphistory/hs$symbol.csv

curl -F line break not interpreted correctly

I'm trying to send a notification via pushover using curl in a bash script.
I cannot get curl -F to interpret the line break correctly though.
curl -s \
-F "token=TOKEN" \
-F "user=USER" \
-F "message=Root Shell Access on HOST \n `date` \n `who` " \
https://api.pushover.net/1/messages.json > NUL
I've tried:
\n
\\\n
%A0
I'd rather push the message out directly, not through a file.
curl doesn't interpret backslash escapes, so you have to insert an actual newline into the argument which curl sees. In other words, you have to get the shell (bash in this case) to interpret the \n, or you need to insert a real newline.
A Posix standard shell does not interpret C escapes like \n, although the standard utility command printf does. However, bash does provide a way to do it: in the quotation form $'...' C-style backslash escapes will be interpreter. Otherwise, $'...' acts just like '...', so that parameter and command substitutions do not take place.
However, any shell -- including bash -- allows newlines to appear inside quotes, and the newline is just passed through as-is. So you could write:
curl -s \
-F "token=$TOKEN" \
-F "user=$USER" \
-F "message=Root Shell Access on $HOST
$(date)
$(who)
" \
https://api.pushover.net/1/messages.json > /dev/null
(Note: I inserted parameter expansions where it seemed like they were missing from the original curl command and changed the deprecated backtick command substitutions to the recommended $(...) form.)
The only problem with including literal newlines, as above, is that it messes up indentation, if you care about appearances. So you might prefer bash's $'...' form:
curl -s \
-F "token=$TOKEN" \
-F "user=$USER" \
-F "message=Root Shell Access on $HOST"$'\n'"$(date)"$'\n'"$(who)" \
https://api.pushover.net/1/messages.json > /dev/null
That's also a little hard to read, but it is completely legal. The shell allows a single argument ("word") to be composed of any number of quoted or unquoted segments, as long as there is no whitespace between the segments. But you can avoid the multiple quote syntax by predefining a variable, which some people find more readable:
NL=$'\n'
curl -s \
-F "token=$TOKEN" \
-F "user=$USER" \
-F "message=Root Shell Access on $HOST$NL$(date)$NL$(who)" \
https://api.pushover.net/1/messages.json > /dev/null
Finally, you could use the standard utility printf, if you are more used to that style:
curl -s \
-F "token=$TOKEN" \
-F "user=$USER" \
-F "$(printf "message=Root Shell Access on %s\n%s\n%s\n" \
"$HOST" "$(date)" "$(who)")" \
https://api.pushover.net/1/messages.json > /dev/null

How to get the file size on Unix in a Makefile?

I would like to implement this as a Makefile task:
# step 1:
curl -u username:password -X POST \
-d '{"name": "new_file.jpg","size": 114034,"description": "Latest release","content_type": "text/plain"}' \
https://api.github.com/repos/:user/:repo/downloads
# step 2:
curl -u username:password \
-F "key=downloads/octocat/Hello-World/new_file.jpg" \
-F "acl=public-read" \
-F "success_action_status=201" \
-F "Filename=new_file.jpg" \
-F "AWSAccessKeyId=1ABCDEF..." \
-F "Policy=ewogIC..." \
-F "Signature=mwnF..." \
-F "Content-Type=image/jpeg" \
-F "file=#new_file.jpg" \
https://github.s3.amazonaws.com/
In the first part however, I need to get the file size (and content type if it's easy, not required though), so some variable:
{"name": "new_file.jpg","size": $(FILE_SIZE),"description": "Latest release","content_type": "text/plain"}
I tried this but it doesn't work (Mac 10.6.7):
$(shell du path/to/file.js | awk '{print $1}')
Any ideas how to accomplish this?
If you have GNU coreutils:
FILE_SIZE=$(stat -L -c %s $filename)
The -L tells it to follow symlinks; without it, if $filename is a symlink it will give you the size of the symlink rather than the size of the target file.
The MacOS stat equivalent appears to be:
FILE_SIZE=$(stat -L -f %z)
but I haven't been able to try it. (I've written this as a shell command, not a make command.) You may also find the -s option useful:
Display information in "shell output", suitable for initializing variables.
For reference, an alternative method is using du with -b bytes output and -s for summary only. Then cut to only keep the first element of the return string
FILE_SIZE=$(du -sb $filename | cut -f1)
This should return the same result in bytes as #Keith Thompson answer, but will also work for full directory sizes.
Extra: I usually use a macro for this.
define sizeof
$$(du -sb \
$(1) \
| cut -f1 )
endef
Which can then be called like,
$(call sizeof,$filename_or_dirname)
I think this is a case where parsing the output of ls is legitimate:
% FILE_SIZE=`ls -l $filename | awk '{print $5}'`
(no it's not: use stat, as noted by Keith Thompson)
For the type, you can use
% FILE_TYPE=`file --mime-type --brief $filename`

Wget page title

Is it possible to Wget a page's title from the command line?
input:
$ wget http://bit.ly/rQyhG5 <<code>>
output:
If it’s broke, fix it right - Keeping it Real Estate. Home
This script would give you what you need:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.
This might be a little better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but it does not fit your case as your page contains the following head opening:
<head profile="http://gmpg.org/xfn/11">
Again, this might be better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but there is still ways to break it, including no head/title in the page.
Again, a better solution might be:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.
The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'
Update:
As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'
Update 2:
As above still not working on Mac, try:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'
and/or
cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -f script
(Note the \ before the $ to avoid variable expansion.)
It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.
The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.
lynx -dump example.com | sed '2q;d'

Resources