The script submits the files and Post submit , The API service returns "task_id" of the submitted samples ( #task.csv )
#file_submitter.sh
#!/bin/bash
for i in $(find $1 -type f);do
task_id="$(curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload &)"
echo "$task_id" >> task.csv
done
Run Method :
$./submitter.sh /home/files/
Results : ( Here 761 & 762 is the task_id of the submitted sample from the API service )
#task.csv
{"task_url": "http://X.X.X.X:8080/api/abc/v1/task/761"}
{"task_url": "http://X.X.X.X:8080/api/abc/v1/task/762"}
I'm giving the entire folder path (find $1 -type f) to find all the files in the directory to upload the files. Now , I'm using "&" operator to submit/upload the files from the folder which will generate 'task_id' from the API service(stdout) and i wanted that 'task_id'(stdout) to store it in 'task.csv'. But the time taken to upload a file with "&" and without "&" is same. Is there any more method to do the submission parallel/faster? Any suggestions please ?
anubhava suggests using xargs with -P option:
find "$1" -type f -print0 |
xargs -0 -P 5 curl -s -F file=#- http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
However, appending to the same file in parallel is generally a bad idea: You really need to know a lot about how this version of the OS buffers output for that to be safe. This example shows why:
#!/bin/bash
size=3000
myfile=/tmp/myfile$$
rm $myfile
echo {a..z} | xargs -P26 -n1 perl -e 'print ((shift)x'$size')' >> $myfile
cat $myfile | perl -ne 'for(split//,$_){
if($_ eq $l) {
$c++
} else {
/\n/ and next;
print $l,1+$c," "; $l=$_; $c=0;
}
}'
echo
With size=10 you will always get (order may differ):
1 d10 i10 c10 n10 h10 x10 l10 b10 u10 w10 t10 o10 y10 z10 p10 j10 q10 s10 v10 r10 k10 e10 m10 f10 g10
Which means that the file contains 10 d's followed by 10 i's followed by 10 c's and so on. I.e. no mixing of the output from the 26 jobs.
But change it to size=30000 and you get something like:
1 c30000 d30000 l8192 g8192 t8192 g8192 t8192 g8192 t8192 g5424 t5424 a8192 i16384 s8192 i8192 s8192 i5424 s13616 f16384 k24576 p24576 n8192 l8192 n8192 l13616 n13616 r16384 u8192 r8192 u8192 r5424 u8192 o16384 b8192 j8192 b8192 j8192 b8192 j8192 b5424 a21808 v8192 o8192 v8192 o5424 v13616 j5424 u5424 h16384 p5424 h13616 x8192 m8192 k5424 m8192 q8192 f8192 m8192 f5424 m5424 q21808 x21808 y30000 e30000 w30000
First 30K c's, then 30K d's, then 8k l's, then 8K g's, 8K t's, then another 8k g's, and so on. I.e. the 26 outputs were mixed together. Very non-good.
For that reason I will advice against appending to the same file in parallel: There is a risk of race condition, and it can often be avoided.
In your case you can simply use GNU Parallel instead of xargs, because GNU Parallel guards against this race condition:
find "$1" -type f -print0 |
parallel -0 -P 5 curl -s -F file=#{} http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
You can use xargs with -P option:
find "$1" -type f -print0 |
xargs -0 -P 5 -I{} curl -s -F file='#{}' http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
This will reduce total execution time by launching 5 curl process in parallel.
The command inside command substitution, $(), runs in a subshell; so here you are sending the curl command in the background of that subshell, not the parent shell.
Get rid of the command substitution, and Just do:
curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload >task.csv &
You're telling the shell to parallelize inside of a command substitution ($()). That's not going to do what you want. Try this instead:
#!/bin/bash
for i in $(find $1 -type f);do
curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload &
done > task.csv
#uncomment next line if you want the script to pause until the last curl is done
#wait
This puts the curl into the background and saves its output into task.csv.
Related
would like to get an opinion on how best to do this in bash, thank you
for x number of servers, each has it's own list of replication agreements and their status.. it's easy to run a few commands and get this data, ex;
get servers, output (setting/variable in/from a local config file);
. ./ldap-config ; echo "$MASTER $REPLICAS"
dc1-server1 dc1-server2 dc2-server1 dc2-server2 dc3...
for dc1-server1, get agreements, output;
ipa-replica-manage -p $(cat ~/.dspw) list -v $SERVER.$DOMAIN | grep ': replica' | sed 's/: replica//'
dc2-server1
dc3-server1
dc4-server1
for dc1-server1, get agreement status codes, output;
ipa-replica-manage -p $(cat ~/.dspw) list -v $SERVER.$DOMAIN | grep 'status: Error (' | sed -e 's/.*status: Error (//' -e 's/).*//'
0
0
18
so output would be several columns based on the 'get servers' list with each 'replica: status' under each server, for that server
looking to achieve something like;
dc2-server1: 0 dc2-server2: 0 dc1-server1: 0 ...
dc3-server1: 0 dc3-server2: 18 dc3-server1: 13 ...
dc4-server1: 18 dc4-server2: 0 dc4-server1: 0 ...
Generally eval is considered evil. Nevertheless, I'm going to use it.
paste is handy for printing files side-by-side.
Bash process substitutions can be used where you'd use a filename.
So, I'm going to dynamically build up a paste command and then eval it
I'm going to use get.sh as a placeholder for your mystery commands.
cmd="paste"
while read -ra servers; do
for server in "${servers[#]}"; do
cmd+=" <(./get.sh \"$server\" agreements | sed 's/\$/:/')"
cmd+=" <(./get.sh \"$server\" status)"
done
done < <(./get.sh servers)
eval "$cmd" | column -t
I'm using an Awk script to split a big text document into independent files. I did it and now I'm working with 14k text files. The problem here is there are a lot of files with just three lines of text and it's not useful for me to keep them.
I know I can delete lines in a text with awk 'NF>=3' file, but I don't want to delete lines inside files, rather I want to delete files which content is just two or three text lines.
Thanks in advance.
Could you please try following findcommand.(tested with GNU awk)
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{if (!f) print FILENAME}' {} \;
So above will print file names who are having lesser than 3 lines on console. Once you are happy with results coming then try following to delete them. Only once you are ok with above command's output run following and even I will suggest run below command in a test directory first and once you are fully satisfied then proceed with below one.(remove echo from below I have still put it for safer side :) )
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{exit !f}' {} \; -exec echo rm -f {} \;
If the files in the current directory are all text files, this should be efficient and portable:
for f in *; do
[ $(head -4 "$f" | wc -l) -lt 4 ] && echo "$f"
done # | xargs rm
Inspect the list, and if it looks OK, then remove the # on the last line to actually delete the unwanted files.
Why use head -4? Because wc doesn't know when to quit. Suppose half of the text files were each more than a terabyte long; if that were the case wc -l alone would be quite slow.
You may use wc to calculate lines and then decide either to delete the file or not. you should write a shell script instead of just awk command.
You can try Perl. The below solution will be efficient as the file handle ARGV will be closed if the line count > 3
perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' *
If you want to pipe the output of some other command (say find) you can use it like
$ find . -name "*" -type f -exec perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' {} \;
./bing.fasta
./chris_smith.txt
./dawn.txt
./drcatfish.txt
./foo.yaml
./ip.txt
./join_tab.pl
./manoj1.txt
./manoj2.txt
./moose.txt
./query_ip.txt
./scottc.txt
./seats.ksh
./tane.txt
./test_input_so.txt
./ya801.txt
$
the output of wc -l * on the same directory
$ wc -l *
12 bing.fasta
16 chris_smith.txt
8 dawn.txt
9 drcatfish.txt
3 fileA
3 fileB
13 foo.yaml
3 hubbs.txt
8 ip.txt
19 join_tab.pl
6 manoj1.txt
6 manoj2.txt
5 moose.txt
17 query_ip.txt
3 rororo.txt
5 scottc.txt
22 seats.ksh
1 steveman.txt
4 tane.txt
13 test_input_so.txt
24 ya801.txt
200 total
$
I am currently building a bash script for class, and I am trying to use the grep command to grab the values from a simple calculator program and store them in the variables I assign, but I keep receiving a syntax error message when I try to run the script. Any advice on how to fix it? my script looks like this:
#!/bin/bash
addanwser=$(grep -o "num1 + num2" Lab9 -a 5 2)
echo "addanwser"
subanwser=$(grep -o "num1 - num2" Lab9 -s 10 15)
echo "subanwser"
multianwser=$(grep -o "num1 * num2" Lab9 -m 3 10)
echo "multianwser"
divanwser=$(grep -o "num1 / num2" Lab9 -d 100 4)
echo "divanwser"
modanwser=$(grep -o "num1 % num2" Lab9 -r 300 7)
echo "modawser"`
You want to grep the output of a command.
grep searches from either a file or standard input. So you can say either of these equivalent:
grep X file # 1. from a file
... things ... | grep X # 2. from stdin
grep X <<< "content" # 3. using here-strings
For this case, you want to use the last one, so that you execute the program and its output feeds grep directly:
grep <something> <<< "$(Lab9 -s 10 15)"
Which is the same as saying:
Lab9 -s 10 15 | grep <something>
So that grep will act on the output of your program. Since I don't know how Lab9 works, let's use a simple example with seq, that returns numbers from 5 to 15:
$ grep 5 <<< "$(seq 5 15)"
5
15
grep is usually used for finding matching lines of a text file. To actually grab a part of the matched line other tools such as awk are used.
Assuming the output looks like "num1 + num2 = 54" (i.e. fields are separated by space), this should do your job:
addanwser=$(Lab9 -a 5 2 | awk '{print $NF}')
echo "$addanwser"
Make sure you don't miss the '$' sign before addanwser when echo'ing it.
$NF selects the last field. You may select nth field using $n.
In the sections below, you'll see the shell script I am trying to run on a UNIX machine, along with a transcript.
When I run this program, it gives the expected output but it also gives an error shown in the transcript. What could be the problem and how can I fix it?
First, the script:
#!/usr/bin/bash
while read A B C D E F
do
E=`echo $E | cut -f 1 -d "%"`
if test $# -eq 2
then
I=`echo $2`
else
I=90
fi
if test $E -ge $I
then
echo $F
fi
done
And the transcript of running it:
$ df -k | ./filter.sh -c 50
./filter.sh: line 12: test: capacity: integer expression expected
/etc/svc/volatile
/var/run
/home/ug
/home/pg
/home/staff/t
/packages/turnin
$ _
Before the line that says:
if test $E -ge $I
temporarily place the line:
echo "[$E]"
and you'll find something very much non-numeric, and that's because the output of df -k looks like this:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 954316620 212723892 693109608 24% /
udev 10240 0 10240 0% /dev
: :
The offending line there is the first, which will have its fifth field Use% turned into Use, which is definitely not an integer.
A quick fix may be to change your usage to something like:
df -k | sed -n '2,$p' | ./filter -c 50
or:
df -k | tail -n+2 | ./filter -c 50
Either of those extra filters (sed or tail) will print only from line 2 onwards.
If you're open to not needing a special script at all, you could probably just get away with something like:
df -k | awk -vlimit=40 '$5+0>=limit&&NR>1{print $5" "$6}'
The way it works is to only operate on lines where both:
the fifth field, converted to a number, is at least equal to the limit passed in with -v; and
the record number (line) is two or greater.
Then it simply outputs the relevant information for those matching lines.
This particular example outputs the file system and usage (as a percentage like 42%) but, if you just want the file system as per your script, just change the print to output $6 on its own: {print $6}.
Alternatively, if you do the percentage but without the %, you can use the same method I used in the conditional: {print $5+0" "$6}.
I was working on a project, and I want to contribute with the solution I found:
The code is of the kind:
while true
do
while read VAR
do
......
done < <(find ........ | sort)
sleep 3
done
The errors in the log where:
/dir/script.bin: redirection error: cannot duplicate fd: Too many open files
/dir/script.bin: cannot make pipe for process substitution: Too many open files
/dir/script.bin: line 26: <(find "${DIRVAR}" -type f -name '*.pdf' | sort): Too many open files
find: `/somedirtofind': Too many open files
/dir/script.bin: cannot make pipe for process substitution: Too many open files
/dir/script.bin: cannot make pipe for process substitution: Too many open files
/dir/script.bin: line 26: <(find "${DIRVAR}" -type f -name '*.pdf' | sort): ambiguous redirect
I noticed with command:
ls -l /proc/3657(pid here)/fd that the file descriptors were constantly increasing.
Using Debian 7, GNU bash, version 4.2.37(1)-release (i486-pc-linux-gnu)
The solution that worked for me is:
while true
do
find ........ | sort | while read VAR
do
done
sleep 3
done
Which is, avoiding the subshell at the end, there must be some kind of leak.
Now I don't see file descriptors in the process directory when doing ls
Mail to bugtracker:
The minimum reproduceable code is:
#!/bin/bash
function something() {
while true
do
while read VAR
do
dummyvar="a"
done < <(find "/run/shm/directory" -type f | sort)
sleep 3
done
}
something &
Which fails with many pipes fd open.
Changing the While feed to this:
#!/bin/bash
function something() {
find "/run/shm/directory" -type f | sort | while true
do
while read VAR
do
dummyvar="a"
done
sleep 3
done
}
something &
Works completely normal.
However, removing the call as function in background:
#!/bin/bash
while true
do
while read VAR
do
dummyvar="a"
done < <(find "/run/shm/debora" -type f | sort)
sleep 3
done
But executing the script with ./test.sh & (in background), works
without problems too.
I was having the same issue and played around with the things you suggested. You said:
But executing the script with ./test.sh & (in background), works
without problems too.
So what worked for me is to run in the background and just wait for it to finish each time:
while true
do
while read VAR
do
......
done < <(find ........ | sort) &
wait
done
Another thing that worked was to put the code creating the descriptor into a function without running on background:
function fd_code(){
while read VAR
do
......
done < <(find ........ | sort)
}
while true
do
fd_code
done