Using wget in shell trouble with variable that has \ - bash

I'm trying to run a script for pulling finance history from yahoo. Boris's answer from this thread
wget can't download yahoo finance data any more
works for me ~2 out of 3 times, but fails if the crumb returned from the cookie has a "\" character in it.
Code that sometimes works looks like this
#!usr/bin/sh
symbol=$1
today=$(date +%Y%m%d)
tomorrow=$(date --date='1 days' +%Y%m%d)
first_date=$(date -d "$2" '+%s')
last_date=$(date -d "$today" '+%s')
wget --no-check-certificate --save-cookies=cookie.txt https://finance.yahoo.com/quote/$symbol/?p=$symbol -O C:/trip/stocks/stocknamelist/crumb.store
crumb=$(grep 'root.*App' crumb.store | sed 's/,/\n/g' | grep CrumbStore | sed 's/"CrumbStore":{"crumb":"\(.*\)"}/\1/')
echo $crumb
fileloc=$"https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
echo $fileloc
wget --no-check-certificate --load-cookies=cookie.txt $fileloc -O c:/trip/stocks/temphistory/hs$symbol.csv
rm cookie.txt crumb.store
But that doesn't seem to process in wget the way I intend either, as it seems to be interpreting as described here:
https://askubuntu.com/questions/758080/getting-scheme-missing-error-with-wget
Any suggestions on how to pass the $crumb variable into wget so that wget doesn't error out if $crumb has a "\" character in it?
Edited to show the full script. To clarify I've got cygwin installed with wget package. I call the script from cmd prompt as (example where the script above is named "stocknamedownload.sh, the stock symbol I'm downloading is "A" from the startdate 19800101)
c:\trip\stocks\StockNameList>bash stocknamedownload.sh A 19800101
This script seems to work fine - unless the crumb returned contains a "\" character in it.

The following implementation appears to work 100% of the time -- I'm unable to reproduce the claimed sporadic failures:
#!/usr/bin/env bash
set -o pipefail
symbol=$1
today=$(date +%Y%m%d)
tomorrow=$(date --date='1 days' +%Y%m%d)
first_date=$(date -d "$2" '+%s')
last_date=$(date -d "$today" '+%s')
# store complete webpage text in a variable
page_text=$(curl --fail --cookie-jar cookies \
"https://finance.yahoo.com/quote/$symbol/?p=$symbol") || exit
# extract the JSON used by JavaScript in the page
app_json=$(grep -e 'root.App.main = ' <<<"$page_text" \
| sed -e 's#^root.App.main = ##' \
-e 's#[;]$##') || exit
# use jq to extract the crumb from that JSON
crumb=$(jq -r \
'.context.dispatcher.stores.CrumbStore.crumb' \
<<<"$app_json" | tr -d '\r') || exit
# Perform our actual download
fileloc="https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
curl --fail --cookie cookies "$fileloc" >"hs$symbol.csv"
Note that the tr -d '\r' is only necessary when using a native-Windows jq mixed with an otherwise native-Cygwin set of tools.

You are adding quotes to the value of the variable instead of quoting the expansion. You are also trying to use tools that don't know what JSON is to process JSON; use jq.
wget --no-check-certificate \
--save-cookies=cookie.txt \
"https://finance.yahoo.com/quote/$symbol/?p=$symbol" \
-O C:/trip/stocks/stocknamelist/crumb.store
# Something like thist; it's hard to reverse engineer the structure
# of crumb.store from your pipeline.
crumb=$(jq 'CrumbStore.crumb' crumb.store)
echo "$crumb"
fileloc="https://query1.finance.yahoo.com/v7/finance/download/$symbol?period1=$first_date&period2=$last_date&interval=1d&events=history&crumb=$crumb"
echo "$fileloc"
wget --no-check-certificate \
--load-cookies=cookie.txt "$fileloc" \
-O c:/trip/stocks/temphistory/hs$symbol.csv

Related

POST using Bash in Script

I am currently writing a script which gets a cookie and unique URL from a website, then uses that cookie and URL to post login/password information.
I am having trouble in getting this to work, some of the runs I have done gives me the message "switching from POST to GET" and I can't understand why.
Can someone please help me?
for i in $(cat accounts.txt); do
rm out.out
curl -L -b cookies.txt -c cookies.txt https://website.com | grep -Eo "(http|https)://cookies.website.com/idp[a-zA-Z0-9./?=_-]*" >> out.out
url=`head -1 out.out`
content="$(curl -X POST -L -v -b cookies.txt -d "$i" "$url" )"
echo "$url" "$i" "$content"
done

Bash: Parse Urls from file, process them and then remove them from the file

I am trying to automate a procedure where the system will fetch the contents of a file (1 Url per line), use wget to grab the files from the site (https folder) and then remove the line from the file.
I have made several tries but the sed part (at the end) cannot understand the string (I tried escaping characters) and remove it from that file!
cat File
https://something.net/xxx/data/Folder1/
https://something.net/xxx/data/Folder2/
https://something.net/xxx/data/Folder3/
My line of code is:
cat File | xargs -n1 -I # bash -c 'wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "#" -P /mnt/USB/ && sed -e 's|#||g' File'
It works up until the sed -e 's|#||g' File part..
Thanks in advance!
Dont use cat if it's posible. It's bad practice and can be problem with big files... You can change
cat File | xargs -n1 -I # bash -c
to
for siteUrl in $( < "File" ); do
It's be more correct and be simpler to use sed with double quotes... My variant:
scriptDir=$( dirname -- "$0" )
for siteUrl in $( < "$scriptDir/File.txt" )
do
if [[ -z "$siteUrl" ]]; then break; fi # break line if him empty
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf --no-parent --restrict-file-names=nocontrol --user=test --password=pass --no-check-certificate "$siteUrl" -P /mnt/USB/ && sed -i "s|$siteUrl||g" "$scriptDir/File.txt"
done
#beliy answers looks good!
If you want a one-liner, you can do:
while read -r line; do \
wget -r -nd -l 1 -c -A rar,zip,7z,txt,jpg,iso,sfv,md5,pdf \
--no-parent --restrict-file-names=nocontrol --user=test \
--password=pass --no-check-certificate "$line" -P /mnt/USB/ \
&& sed -i -e '\|'"$line"'|d' "File.txt"; \
done < File.txt
EDIT:
You need to add a \ in front of the first pipe
I believe you just need to use double quotes after sed -e. Instead of:
'...&& sed -e 's|#||g' File'
you would need
'...&& sed -e '"'s|#||g'"' File'
I see what you trying to do, but I dont understand the sed command including pipes. Maybe some fancy format that I dont understand.
Anyway, I think the sed command should look like this...
sed -e 's/#//g'
This command will remove all # from the stream.
I hope this helps!

If proxy is down, get a new one

I'm writing my first bash script
LANG="en_US.UTF8" ; export LANG
PROXY=$(shuf -n 1 proxy.txt)
export https_proxy=$PROXY
RUID=$(php -f randuid.php)
curl --data "mydata${RUID}" --user-agent "myuseragent" https://myurl.com/url -o "ticket.txt"
This script also use curl, but if proxy is down it gives me this error:
failed to connect PROXY:PORT
How can I make bash script run again, so it can get another proxy address from proxy.txt
Thanks in advance
Run it in a loop until the curl succeeds, for example:
export LANG="en_US.UTF8"
while true; do
PROXY=$(shuf -n 1 proxy.txt)
export https_proxy=$PROXY
RUID=$(php -f randuid.php)
curl --data "mydata${RUID}" --user-agent "myuseragent" https://myurl.com/url -o "ticket.txt" && break
done
Notice the && break at the end of the curl command.
That is, if the curl succeeds, break out of the infinite loop.
If you have multiple curl commands and you need all of them to succeed,
then chain them all together with &&, and add the break after the last one:
curl url1 && \
curl url2 && \
break
Lastly, as #Inian pointed out,
you could use the --proxy flag to pass a proxy URL to curl without the extra step of setting https_proxy, for example:
curl --proxy "$(shuf -n 1 proxy.txt)" --data "mydata${RUID}" --user-agent "myuseragent"
Lastly, note that due to the randomness, a randomly selected proxy may come up more than once until you find one that works.
Avoid that, you could read iterate over the shuffled proxies instead of an infinite loop:
export LANG="en_US.UTF8"
shuf proxy.txt | while read -r proxy; do
ruid=$(php -f randuid.php)
curl --proxy "$proxy" --data "mydata${ruid}" --user-agent "myuseragent" https://myurl.com/url -o "ticket.txt" && break
done
I also lowercased your user-defined variables,
as capitalization is not recommended for those.
I know i accepted #janos answer but since I can't edit his I'm going to add this
response=$(curl --proxy "$proxy" --silent --write-out "\n%{http_code}\n" https://myurl.com/url)
status_code=$(echo "$response" | sed -n '$p')
html=$(echo "$response" | sed '$d')
case "$status_code" in
200) echo 'Working!'
;;
*)
echo 'Not working, trying again!';
exec "$0" "$#"
esac
This will run my script again if it gives 503 status code which i wanted :)
And with #janos code it will run again if proxy is not working.
Thank you everyone i achieved what i wanted.

Bash script — wget doesn't use a variable with "{}" correctly

I'm using a script that downloads images. I am having issues while running the wget command with a variable that contains "{" and "}" characters. It transforms into "%7B" and "%7D".
Here is part of the script
wget -nd -H -p -A jpg,jpeg,png,gif -e robots=off "$url"
The "$url" contains something like "http://webpage.com/di1/dir2/{1..26}.jpg".
I just did a Loop
for i in {1..50}
do
wget "$url$i.jpg"
done

BASH WHILE loop not hitting CASE switch

Hey guys title pretty much says it, but I'm echoing two variables into a BASH loop to kick it off, and (supposedly) using a case to be able to identify where they are and run a similar but separate (missing -k flag on second go around) wget statement. I hit my git checkout but it doesn't seem like I'm entering my cases. How do I fix this, or is there a better way to do it since I'm just dropping a -k flag?
#!/bin/bash
echo -e "render\nstorage" | while read x; do
git checkout "$x"
case $x in
$1)
wget "${WGDOMAIN}" -r -l INF -k -p \
--no-check-certificate \
--strict-comments \
--warc-header="Operator: Web Archiver" \
--warc-file="$WGDOMAIN" \
--warc-dedup="${WGDOMAIN}.cdx" \
--warc-cdx=on 2> session.log
;;
$2)
wget "${WGDOMAIN}" -r -l INF -p \
--no-check-certificate \
--strict-comments \
--warc-header="Operator: Web Archiver" \
--warc-file="$WGDOMAIN" \
--warc-dedup="${WGDOMAIN}.cdx" \
--warc-cdx=on 2> session.log
;;
$1|$2)
git add . && git ci -m"Archived: ${DATE}"
git push origin "$x"
;;
esac
done
Called with no positional parameters your script will not do anything inside the case statement as $1 and $2 are empty. Besides that, the last case option $1|$2 will never be reached as the prior ones will match. May be you should get that commands out of the case.

Resources