bash pipe limit - bash

i got a txt list of urls i want to download
n=1
end=`cat done1 |wc -l`
while [ $n -lt $end ]
do
nextUrls=`sed -n "${n}p" < done1`
wget -N nH --random-wait -t 3 -a download.log -A$1 $nextUrls
let "n++"
done
i want to do it faster with pipes but if i do this
wget -N nH --random-wait -t 3 -a download.log -A$1 $nextUrls &
my ram fills up and blocks my PC completely.
Any1 know how to limit pipes created to like 10 at the same time?

You are not creating pipes (|), you are creating background processes (&). Everytime your while executes its body, you create a new wget process and don't wait for it to exit, which (depending on the value of end) may create lot of wget processes very fast. Either do sequentially (remove the &) or you can try executing n processes in parallel and wait for them.
BTW, useless use of cat: you can simply do:
end=`wc -l done1`

i got a txt list of urls i want to download... i want to do it faster..
So here's a shortest way to do that. The following command downloads the URL from the list contained in file *txt_list_of_urls* parallely running 10 threads:
xargs -a txt_list_of_urls -P 10 -r -n 1 wget -nv

Related

Making a Bash script that can open multiple terminals and run wget in each

I have to download bulks of over 100,000 docs from a databank using this script:
#!/usr/bin/bash
IFS=$'\n'
set -f
for line in $(cat < "$1")
do
wget https://www.uniprot.org/uniprot/${line}.txt
done
The first time it took over a week to download all the files (all under 8Kb) so I tried opening multiple terminals and running a split of the total.txt (10 equal splits of 10000 files in 10 terminals) and in just 14 hours I had all the documents downloaded, is there a way to make a script do that for me?
this is a sample of what the list looks like:
D7E6X7
A0A1L9C3F2
A3K3R8
W0K0I7
gnome-terminal -e command
or
xterm -e command
or
konsole -e command
Or
terminal -e command
There is an another alternative to make it fast.
Right now your downloads are synchronized i.e next download process is not started until current one is finished.
Search for how to make command asynchronous/run in background on unix.
When you were doing this by hand, opening multiple terminals made sense. If you want to script this, you can run multiple processes from one terminal/script. You could use xargs to start multiple processes simultaneously:
xargs -a list.txt -n 1 -P 8 -I # bash -c "wget https://www.uniprot.org/uniprot/#.txt"
Where:
-a list.txt tells xargs to use the list.txt file as input.
-n 1 tells xargs to use a maximum of one argument (from the input) for each command it runs.
-P 8 tells xargs to run 8 commands at a time, you can change this to suit your system/requirements.
-I # tells xargs to use "#" to represent the input (i.e. the line from your file).

Run jobs in sequence rather than consecutively using bash

So I work a lot with Gaussian 09 (the computational chemistry software) on a supercomputer.
To submit a job I use the following command line
g09sub input.com -n 2 -m 4gb -t 200:00:00
Where n is the number of processors used, m is the memory requested, and t is the time requested.
I was wondering if there was a way to write a script that will submit the first 10 .com files in the folder and then submit another .com file as each finishes.
I have a script that will submit all the .com files in a folder at once, but I have a limit to how many jobs I can queue on the supercomputer I use.
The current script looks like
#!/bin/bash
#SBATCH --partition=shared
for i in *.com
do g09sub $i -n 2 -m 4gb -t 200:00:00
done
So 1.com, 2.com, 3.com, etc would be submitted all at the same time.
What I want is to have 1.com, 2.com, 3.com, 4.com, 5.com, 6.com, 7.com, 8.com, 9.com, and 10.com all start at the same time and then as each of those finishes have another .com file start. So that no more than 10 jobs from any one folder will be running at the same time.
If it would be useful, each job creates a .log file when it is finished.
Though I am unsure if it is important, the supercomputer uses a PBS queuing system.
Try xargs or GNU parallel
xargs
ls *.com | xargs -I {} g09sub -P 10 {} -n 2 -m 4gb -t 200:00:00
Explanation:
-I {} tell that {} will represent input file name
-P 10 set max jobs at once
parallel
ls *.com | parallel -P 10 g09sub {} -n 2 -m 4gb -t 200:00:00 # GNU parallel supports -P too
ls *.com | parallel --jobs 10 g09sub {} -n 2 -m 4gb -t 200:00:00
Explanation:
{} represent input file name
--jobs 10 set max jobs at once
Not sure about the availability on your supercomputer, but the GNU bash manual offers a parallel example under 3.2.6 GNU Parallel, at the bottom.
There are ways to run commands in parallel that are not built into Bash. GNU Parallel is a tool to do just that.
...
Finally, Parallel can be used to run a sequence of shell commands in parallel, similar to ‘cat file | bash’. It is not uncommon to take a list of filenames, create a series of shell commands to operate on them, and feed that list of commands to a shell. Parallel can speed this up. Assuming that file contains a list of shell commands, one per line,
parallel -j 10 < file
will evaluate the commands using the shell (since no explicit command
is supplied as an argument), in blocks of ten shell jobs at a time.
Where that option was not available to me, using the jobs function worked rather crudely. eg:
for entry in *.com; do
while [ $(jobs | wc -l) -gt 9 ]; do
sleep 1 # this is in seconds; your sleep may support 'arbitrary floating point number'
done
g09sub ${entry} -n 2 -m 4gb -t 200:00:00 &
done
$(jobs | wc -l) counts the number of jobs spawned in the background by ${cmd} &

BASH - transfer large files and process after transfer limiting the number of processes

I have several large files that I need to transfer to a local machine and process. The transfer takes about as long as the processing of the file, and I would like to start processing it immediately after it transfers. But the processing could take longer than the transfer, and I don't want the processes to keep building up, but I would like to limit it to some number, say 4.
Consider the following:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
scp user#host:$FILE ./
myCommand $FILE &
done
This will transfer each file and start processing it after the transfer while allowing the next file to start transferring. However, if myCommand $FILE takes much longer than the time to transfer one file, these could keep piling up and bogging down the local machine. So I would like to limit myCommand to maybe 2-4 parallel instances. Subsequent attempts to invoke myCommand should buffer it until a "slot" is open. Is there a good way to do this in BASH (using xargs or other utilities is acceptable).
UPDATE:
Thanks for the help in getting this far. Now I'm trying to implement the following logic:
LIST_OF_LARGE_FILES="file1 file2 file3 file4 ... fileN"
for FILE in $LIST_OF_LARGE_FILES; do
echo "Starting on $FILE" # should go to terminal output
scp user#host:$FILE ./
echo "Processing $FILE" # should go to terminal output
echo $FILE # should go through pipe to parallel
done | parallel myCommand
You can use GNU Parallel for that. Just echo the commands you want run into parallel and it will run one job per CPU core your machine has.
for f in ... ; do
scp ...
echo ./process "$f"
done | parallel
If you specifically want 4 processes at a time, use parallel -j 4.
If you want a progress bar, use parallel --bar.
Alternatively, echo just the filename with null-termination, and add the processing command into the invocation of parallel:
for f in ... ; do
scp ...
printf "%s\0" "$f"
done | parallel -0 -j4 ./process

Maintaining a set number of concurrent jobs w/ args from a file in bash

I found this script on the net, I don't know to work in bash too much is too weird but..
Here's my script:
CONTOR=0
for i in `cat targets`
do
CONTOR=`ps aux | grep -c php`
while [ $CONTOR -ge 250 ];do
CONTOR=`ps aux | grep -c php`
sleep 0.1
done
if [ $CONTOR -le 250 ]; then
php b $i > /dev/null &
fi
done
My targets are urls, and the b php file is a crawler which save some links into a file. The problem is max numbers of threads is 50-60 and that's because the crawler finish very fast and that bash script code doesn't have time to open my all 250 threads. It's any chance to do something to open all threads (250) ? It is possible to run more than one thread per ps -aux process? Right know seems he open 1 thread after execute ps -aux.
First: Bash has no multithreading support whatsoever. foo & starts a separate process, not a thread.
Second: launching ps to check for children is both prone to false positives (treating unrelated invocations of php as if they were jobs in the current process) and extremely inefficient if done in a loop (since every invocation involves a fork()/exec()/wait() cycle).
Thus, don't do it that way: Use a release of GNU xargs with -P, or (if you must) GNU parallel.
Assuming your targets file is newline-delimited, and has no special quoting or characters, this could be as simple as:
xargs -d $'\n' -n 1 -P 250 php b <targets
...or, for pure POSIX shells:
xargs -d "
" -n 1 -P 250 php b <targets
With GNU Parallel it looks like this (choose the style you like best):
cat targets | parallel -P 250 php b
parallel -a targets -P 250 php b
parallel -P 250 php b :::: targets
There is no risk of false positives if there are other php processes running. And unlike xargs there is no risk if the file targets contain space, " or '.

How to pipe cmdline output of program to run multiple times and halt when a keyword appears?

Say I want to run a C program 1000 times, and this program is basically a test script that tests the functionality of a simple kernel I have written. It outputs a "SUCCESS" every time it fails. Because of various race conditions that are hard to track down, we often have to run the test manually literally a few hundred times before it fails. I have tried searching the net in vain for perl scripts or bash scripts that can help us run this command:
pintos -v -k -T 60 --qemu -j 2 --filesys-size=2 -p tests/vm/page-parallel -a page-parallel -p tests/vm/child-linear -a child-linear --swap-size=4 -- -q -f run page-parallel < /dev/null
and pipe the command to something to check for a keyword so it can halt/continue if that keyword appears.
Anyone can point me in the right direction?
In bash you can just run it in a while loop:
while true; do
if "pintos -v -k -T 60 --qemu -j 2 --filesys-size=2 -p tests/vm/page-parallel -a page-parallel -p tests/vm/child-linear -a child-linear --swap-size=4 -- -q -f run page-parallel < /dev/null" | grep -c KEYWORD; then
break
fi
done
I'm not 100% sure about the quoting you'd need around the command, obviously I can't run your specific command. It may not need the "" around it.
grep -c counts the matches, if 0 then the KEYWORD was not found so it runs the loop again. If > 0 then the KEYWORD was found and the loop breaks out.

Resources