I'm writing a script that takes a list of ~300 URLs as input, which have the following format:
http://long.domain.prefix/folder/subfolder/filename.html
Of that URL, I'd like to save filename.htmlin ./folder/subfolder/ - if that folder structure doesn't exist, it must be created. This works, the folders are being written to disk, however no files are downloaded.
My script looks like this:
#!/bin/bash
for line in `cat list.txt`; do
# strips the URL prefix and trailing slash
name=${line#http://long.domain.prefix\/}
/usr/bin/curl -m 10 -f -o $name --create-dirs $fullname
done;
For some reason, the $name variable is cut off after exactly 74 characters, which obviously results in HTTP error codes. I can't give out the exact URLs, but rest assured they are correct, as long as the full URL is being used.
How can I prevent this odd cutting-off behavior?
Thanks to Etan Reisner, the solution was to convert the file to Unix-Style line endings.
Related
I have the variables below in a large and important SH file and I need to remove some data from a variable and keep only part of the text.
I get "repoTest" with a link to an internal git repo and I need the variable "nameAppTest" to only receive the constant data after the last "/".
Example:
I get: repoTest="ssh://git#code.br.repo.local/code/ecsb/name-repo.git"
I try to do a split: nameAppTest=$(echo "$repoTest"|cut -d'/' -f5|sed -e 's/.git//g')
Response I get: echo "$nameAppTest" (ecsb).
What I expect to receive: name-repo
I tried like this and failed: nameAppTest=$(echo "$repoTest"|cut -d'/' -f5|sed -e 's/.git//g')
Here's a nifty trick:
nameAppTest=$(basename "$repoTest" .git)
Uses basename to get just the last component of the URL, and strip the extension all in one step.
You can also use sh parameter expansion to do it in two steps without any external programs:
# Remove everything up to and including the last /
temp="${repoTest##*/}"
# Remove the trailing .git
nameAppTest="${temp%.git}"
I have some markdown files to process which contain links to images that I wish to download. e.g. a markdown file:
[![](https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png)
a lot of text
some more text...
[![](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s320/take_a_break_git.gif)](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s1600/take_a_break_git.gif)
some more text
another URL but not image
[https://github.com]
so on
I am trying to parse through this file and extract the list of image URLs, which I can later pass on wget command to download.
So far I have used grep and sed and have got results:
$ sed -nE "/https?:\/\/[^ ]+.(jpg|png|gif)/p" $path
[![](https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png)
[![](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s320/take_a_break_git.gif)](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s1600/take_a_break_git.gif)
$ grep -Eo "https?://[^ ]+.(jpg|png|gif)" $path
https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png
https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s320/take_a_break_git.gif)](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s1600/take_a_break_git.gif
The regex is essentially working fine, but the issue is that as the same URL is present twice in the same line, the text selected is the first occurrence of https and last occurrence of jpg|png|gif. But I want the first occurrence of https and first occurrence of jpg|png|gif
How can fix this?
P.S. I have also tried lynx -dump -image_links -listonly $path but this prints the entire file.
I am also open to other options that solve the purpose, and as long as I can hook the code up in my current shell script.
You may add square brackets into the negated bracket expression:
grep -Eo "https?://[^][ ]+\.(jpg|png|gif)"
See the online demo. Details:
https?:// - http:// or https://
[^][ ]+ - one or more chars other than ], [ and space
\. - a dot
(jpg|png|gif) - either of the three alternative substrings.
I have a plain text file, that is formatted like:
10 https://google.com
22 https://facebook.com
I'd like to parse this file in a bash script and for each line and the number before the url make that many wget requests to the url.
I know I can simply read in the file with:
URLS=$(cat ./urls)
But how do I split on the number and space and newlines and run the wget command for each line?
Use read to read each part into a variable, and while to loop through the lines.
while read prio url
do
...
done < ./urls
I am reading a file (with URL's) line by line:
#!/bin/bash
while read line
do
url=$line
wget $url
wget $url_{001..005}.jpg
done < $1
For first, I want to download primary url as you see wget $url. After that I want to add to the url sequence of numbers (_001.jpg, _002.jpg, _003.jpg, _004.jpg, _005.jpg):
wget $url_{001..005}.jpg
...but for some reason it's not working.
Sorry, missed out one thing: the url's are like http://xy.com/052914.jpg. Is there any easy way to add _001 before the extension? http://xy.com/052914_001.jpg. Or I have to remove ".jpg" from the file containing URL's then simply add later to the variable?
Another way escaping the underscore char:
wget $url\_{001..005}.jpg
Try encapsulating your variable name:
wget ${url}_{001..005}.jpg
Bash is trying to expand the variable $url_ in your command.
As for your jpg within the URL followup, see substring expansion in the bash manual.
wget ${url:0: -4}_{001..005}.jpg
The :0: -4 means, expand to the variable from position zero (the first character), minus the last 4 characters.
Or from this answer:
wget ${url%.jpg}_{001..005}.jpg
%.jpg removes .jpg specifically and will work on older versions of bash.
I would need to read certain data using curl. I'm basically reading keywords from file
while read line
do
curl 'https://gdata.youtube.com/feeds/api/users/'"${line}"'/subscriptions?v=2&alt=json' \
> '/home/user/archive/'"$line"
done < textfile.txt
Anyway I haven't found a way to form the url to curl so it would work. I've tried like every possible single and double quoted versions. I've tried basically:
'...'"$line"'...'
"..."${line}"..."
'...'$line'...'
and so on.. Just name it and I'm pretty sure that I've tried it.
When I'm printing out the URL in the best case it will be formed as:
/subscriptions?v=2&alt=jsoneeds/api/users/KEYWORD FROM FILE
or something similar. If you know what could be the cause of this I would appreciate the information. Thanks!
It's not a quoting issue. The problem is that your keyword file is in DOS format -- that is, each line ends with carriage return & linefeed (\r\n) rather than just linefeed (\n). The carriage return is getting read into the line variable, and included in the URL. The giveaway is that when you echo it, it appears to print:
/subscriptions?v=2&alt=jsoneeds/api/users/KEYWORD FROM FILE"
but it's really printing:
https://gdata.youtube.com/feeds/api/users/KEYWORD FROM FILE
/subscriptions?v=2&alt=json
...with just a carriage return between them, so the second overwrites the first.
So what can you do about it? Here's a fairly easy way to trim the cr at the end of the line:
cr=$'\r'
while read line
do
line="${line%$cr}"
curl "https://gdata.youtube.com/feeds/api/users/${line}/subscriptions?v=2&alt=json" \
> "/home/user/archive/$line"
done < textfile.txt
Your current version should work, I think. More elegant is to use a single pair of double quotes around the whole URL with the variable in ${}:
"https://gdata.youtube.com/feeds/api/users/${line}/subscriptions?v=2&alt=json"
Just use it like this, should be sufficient enough:
curl "https://gdata.youtube.com/feeds/api/users/${line}/subscriptions?v=2&alt=json" > "/home/user/archive/${line}"
If your shell gives you issues with & just put \&, but it works fine for me without it.
If the data from the file can contain spaces and you have no objection to spaces in the file name in the /home/user/archive directory, then what you've got should be OK.
Given the contents of the rest of the URL, you could even just write:
while read line
do
curl "https://gdata.youtube.com/feeds/api/users/${line}/subscriptions?v=2&alt=json" \
> "/home/user/archive/${line}"
done < textfile.txt
where strictly the ${line} could be just $line in both places. This works because the strings are fixed and don't contain shell metacharacters.
Since you're code is close to this, but you claim that you're seeing the keywords from the file in the wrong place, maybe a little rewriting for ease of debugging is in order:
while read line
do
url="https://gdata.youtube.com/feeds/api/users/${line}/subscriptions?v=2&alt=json"
file="/home/user/archive/${line}"
curl "$url" > "$file"
done < textfile.txt
Since the strings may end up containing spaces, it seems (do you need to expand spaces to + in the URL?), the quotes around the variables are strongly recommended. You can now run the script with sh -x (or add a line set -x to the script) and see what the shell thinks it is doing as it is doing it.