I have a plain text file, that is formatted like:
10 https://google.com
22 https://facebook.com
I'd like to parse this file in a bash script and for each line and the number before the url make that many wget requests to the url.
I know I can simply read in the file with:
URLS=$(cat ./urls)
But how do I split on the number and space and newlines and run the wget command for each line?
Use read to read each part into a variable, and while to loop through the lines.
while read prio url
do
...
done < ./urls
Related
I am currently trying to create a bash script that will change a text document of one line into multiple lines.
Example:
TextFile: Header~someHeaderInfo|Object~someObjectInfo|SubObject~someSubObjectInfo|Object~someObjectInfo|SubObject~someSubObjectInfo|...|Tail~someInfo
Again, this above is only a single line.
This should be called through a bash script and be converted into:
Header~someHeaderInfo
Object~someObjectInfo|SubObject~someSubObjectInfo
Object~someObjectInfo|SubObject~someSubObjectInfo
...
Tail~someInfo
In the real use case, each Object has upwards of 20 subObjects, each of which may have more subObjects themselves.
How can I go about this separation?
if textfile contains:
Header~someHeaderInfo|Object~someObjectInfo|SubObject~someSubObjectInfo|Object~someObjectInfo|SubObject~someSubObjectInfo|...|Tail~someInfo
The following bash command:
sed "s/|/\n/g" textfile
Will produce the following output:
Header~someHeaderInfo
Object~someObjectInfo
SubObject~someSubObjectInfo
Object~someObjectInfo
SubObject~someSubObjectInfo
...
Tail~someInfo
But the OP wants the SubObject on the same line (see comments), so I suggest:
sed "s/|\([^S]\)/\n\1/g" textfile
That will produce the following output:
Header~someHeaderInfo
Object~someObjectInfo|SubObject~someSubObjectInfo
Object~someObjectInfo|SubObject~someSubObjectInfo
...
Tail~someInfo
I have a python script that is pulling URLs from pastebin.com/archive, which has links to pastes (which have 8 random digits after pastbin.com in the url). My current output is a .txt with the below data in it, I only want the links to pastes present (Example: http://pastebin.com///Y5JhyKQT) and not links to other pages such as pastebin.com/tools). This is so I can set wget to go pull each individual paste.
The only way I can think of doing this is writing a bash script to count the number of characters in each line and only keep lines with 30 characters exactly (this is the length of the URLs linking to pastes).
I have no idea how I'd go about implementing something like this using grep or awk, perhaps using a while do loop? Any help would be appreciated!
http://pastebin.com///tools
http://pastebin.com//top.location.href
http://pastebin.com///trends
http://pastebin.com///Y5JhyKQT <<< I want to keep this
http://pastebin.com//=
http://pastebin.com///>
From the sample you posted it looks like all you need is:
grep -E '/[[:alnum:]]{8}$' file
or maybe:
grep -E '^.{30}$' file
If that doesn't work for you, explain why and provide a better sample.
This is the algorithm
Find all characters between new line characters or read one line at a time.
Count them or store them in variable and get its count. This is the length of your line.
Only process those lines that are exactly same count as you want.
In python there is both functions character count of string and reading line as well.
#!/usr/bin/env zsh
while read aline
do
if [[ ${#aline} == 30 ]]; then
#do something
fi
done
This is documented in the bash man pages under the "Parameter Expansion" section.
EDIT=this solution is zsh-only
I'm writing a script that takes a list of ~300 URLs as input, which have the following format:
http://long.domain.prefix/folder/subfolder/filename.html
Of that URL, I'd like to save filename.htmlin ./folder/subfolder/ - if that folder structure doesn't exist, it must be created. This works, the folders are being written to disk, however no files are downloaded.
My script looks like this:
#!/bin/bash
for line in `cat list.txt`; do
# strips the URL prefix and trailing slash
name=${line#http://long.domain.prefix\/}
/usr/bin/curl -m 10 -f -o $name --create-dirs $fullname
done;
For some reason, the $name variable is cut off after exactly 74 characters, which obviously results in HTTP error codes. I can't give out the exact URLs, but rest assured they are correct, as long as the full URL is being used.
How can I prevent this odd cutting-off behavior?
Thanks to Etan Reisner, the solution was to convert the file to Unix-Style line endings.
I am reading a file (with URL's) line by line:
#!/bin/bash
while read line
do
url=$line
wget $url
wget $url_{001..005}.jpg
done < $1
For first, I want to download primary url as you see wget $url. After that I want to add to the url sequence of numbers (_001.jpg, _002.jpg, _003.jpg, _004.jpg, _005.jpg):
wget $url_{001..005}.jpg
...but for some reason it's not working.
Sorry, missed out one thing: the url's are like http://xy.com/052914.jpg. Is there any easy way to add _001 before the extension? http://xy.com/052914_001.jpg. Or I have to remove ".jpg" from the file containing URL's then simply add later to the variable?
Another way escaping the underscore char:
wget $url\_{001..005}.jpg
Try encapsulating your variable name:
wget ${url}_{001..005}.jpg
Bash is trying to expand the variable $url_ in your command.
As for your jpg within the URL followup, see substring expansion in the bash manual.
wget ${url:0: -4}_{001..005}.jpg
The :0: -4 means, expand to the variable from position zero (the first character), minus the last 4 characters.
Or from this answer:
wget ${url%.jpg}_{001..005}.jpg
%.jpg removes .jpg specifically and will work on older versions of bash.
I have a bash script that runs and outputs to a text file however the colour codes it uses are also included what i'd like to know is how to remove them from the file, ie
^[[38;1;32mHello^[[39m
^[[38;1;31mUser^[[39m
so I just want to be left with Hello and User, so something like sed -r "special characters" from file A save to file B
sed 's/\^\[\[[^m]*m//g'
remove (all) part of line starting with ^[[ until first m
Some like this:
awk '{sub(/\^\[\[38;1;[0-9][0-9]m/,x);sub(/\^\[\[39m/,x)}1'
Hello
User