Bash running out of File Descriptors - bash

I was working on a project, and I want to contribute with the solution I found:
The code is of the kind:
while true
do
while read VAR
do
......
done < <(find ........ | sort)
sleep 3
done
The errors in the log where:
/dir/script.bin: redirection error: cannot duplicate fd: Too many open files
/dir/script.bin: cannot make pipe for process substitution: Too many open files
/dir/script.bin: line 26: <(find "${DIRVAR}" -type f -name '*.pdf' | sort): Too many open files
find: `/somedirtofind': Too many open files
/dir/script.bin: cannot make pipe for process substitution: Too many open files
/dir/script.bin: cannot make pipe for process substitution: Too many open files
/dir/script.bin: line 26: <(find "${DIRVAR}" -type f -name '*.pdf' | sort): ambiguous redirect
I noticed with command:
ls -l /proc/3657(pid here)/fd that the file descriptors were constantly increasing.
Using Debian 7, GNU bash, version 4.2.37(1)-release (i486-pc-linux-gnu)

The solution that worked for me is:
while true
do
find ........ | sort | while read VAR
do
done
sleep 3
done
Which is, avoiding the subshell at the end, there must be some kind of leak.
Now I don't see file descriptors in the process directory when doing ls
Mail to bugtracker:
The minimum reproduceable code is:
#!/bin/bash
function something() {
while true
do
while read VAR
do
dummyvar="a"
done < <(find "/run/shm/directory" -type f | sort)
sleep 3
done
}
something &
Which fails with many pipes fd open.
Changing the While feed to this:
#!/bin/bash
function something() {
find "/run/shm/directory" -type f | sort | while true
do
while read VAR
do
dummyvar="a"
done
sleep 3
done
}
something &
Works completely normal.
However, removing the call as function in background:
#!/bin/bash
while true
do
while read VAR
do
dummyvar="a"
done < <(find "/run/shm/debora" -type f | sort)
sleep 3
done
But executing the script with ./test.sh & (in background), works
without problems too.

I was having the same issue and played around with the things you suggested. You said:
But executing the script with ./test.sh & (in background), works
without problems too.
So what worked for me is to run in the background and just wait for it to finish each time:
while true
do
while read VAR
do
......
done < <(find ........ | sort) &
wait
done
Another thing that worked was to put the code creating the descriptor into a function without running on background:
function fd_code(){
while read VAR
do
......
done < <(find ........ | sort)
}
while true
do
fd_code
done

Related

How do I get a list of files that have line count below 18

I need to search for files in a directory by month/year and pass them through wc -l or lines and test if [ $lines -le 18 ], or something similar and give me a list of files that match.
In the past I called this with 'file.sh 2020-06' and used something like this to process the files for that month:
find . -name "* $1-*" -exec grep '(1 |2 |3 )' {}
but I now need to test for a line count.
The above -exec worked but when I changed over to passing the file to another exec I get complaints of "too many parameters" because the file name has spaces. I just can't seem to get on track with solving this one.
Any pointers to get me going would be very much appreciated.
Rick
Here's one using find and awk. But first some test files (Notice: it creates files named 16, 17, 18 and 19):
$ for i in 16 17 18 19 ; do seq 1 $i > $i ; done
Then:
$ find . -name 1\[6789\] -exec awk 'NR==18{exit c=1}END{if(!c) print FILENAME}' {} \;
./16
./17

Remove text files with less than three lines

I'm using an Awk script to split a big text document into independent files. I did it and now I'm working with 14k text files. The problem here is there are a lot of files with just three lines of text and it's not useful for me to keep them.
I know I can delete lines in a text with awk 'NF>=3' file, but I don't want to delete lines inside files, rather I want to delete files which content is just two or three text lines.
Thanks in advance.
Could you please try following findcommand.(tested with GNU awk)
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{if (!f) print FILENAME}' {} \;
So above will print file names who are having lesser than 3 lines on console. Once you are happy with results coming then try following to delete them. Only once you are ok with above command's output run following and even I will suggest run below command in a test directory first and once you are fully satisfied then proceed with below one.(remove echo from below I have still put it for safer side :) )
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{exit !f}' {} \; -exec echo rm -f {} \;
If the files in the current directory are all text files, this should be efficient and portable:
for f in *; do
[ $(head -4 "$f" | wc -l) -lt 4 ] && echo "$f"
done # | xargs rm
Inspect the list, and if it looks OK, then remove the # on the last line to actually delete the unwanted files.
Why use head -4? Because wc doesn't know when to quit. Suppose half of the text files were each more than a terabyte long; if that were the case wc -l alone would be quite slow.
You may use wc to calculate lines and then decide either to delete the file or not. you should write a shell script instead of just awk command.
You can try Perl. The below solution will be efficient as the file handle ARGV will be closed if the line count > 3
perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' *
If you want to pipe the output of some other command (say find) you can use it like
$ find . -name "*" -type f -exec perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' {} \;
./bing.fasta
./chris_smith.txt
./dawn.txt
./drcatfish.txt
./foo.yaml
./ip.txt
./join_tab.pl
./manoj1.txt
./manoj2.txt
./moose.txt
./query_ip.txt
./scottc.txt
./seats.ksh
./tane.txt
./test_input_so.txt
./ya801.txt
$
the output of wc -l * on the same directory
$ wc -l *
12 bing.fasta
16 chris_smith.txt
8 dawn.txt
9 drcatfish.txt
3 fileA
3 fileB
13 foo.yaml
3 hubbs.txt
8 ip.txt
19 join_tab.pl
6 manoj1.txt
6 manoj2.txt
5 moose.txt
17 query_ip.txt
3 rororo.txt
5 scottc.txt
22 seats.ksh
1 steveman.txt
4 tane.txt
13 test_input_so.txt
24 ya801.txt
200 total
$

Monitoring Tool for checking files in a directory

I have four files which come in to this directory C:/Desktop/Folder each day which are in the following format:
DD/MM/YYYY/HH/MM/SS/File
Examples:
01/01/2018/01:01:00
01/01/2018/01:02:00
01/01/2018/01:03:00
01/01/2018/01:04:00
I want my script to only email me if in a 24 hour period the 4th file:
01/01/2018/01:04:00 does not come into the mailbox. I cannot do an = as each day the date increments by 1 for example:
01/01/2018/01:01:00 then
02/01/2018/01:01:00
Code:
#!/bin/bash
monitor_dir=/path/to/dir
email=me#me.com
files=$(find "$monitor_dir" -maxdepth 1 | sort)
IFS=$'\n'
while true
do
sleep 5s
newfiles=$(find "$monitor_dir" -maxdepth 1 | sort)
added=$(comm -13 <(echo "$files") <(echo "$newfiles"))
[ "$added" != "" ] && find $added -maxdepth 1 -printf '%Tc\t%s\t%p\n' | mail -s "incoming" "$email"
files="$newfiles"
done
Can I please have some assistance with how I can alter this code to reflect my new requirements?

Bash Multiple cURL request Issue

The script submits the files and Post submit , The API service returns "task_id" of the submitted samples ( #task.csv )
#file_submitter.sh
#!/bin/bash
for i in $(find $1 -type f);do
task_id="$(curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload &)"
echo "$task_id" >> task.csv
done
Run Method :
$./submitter.sh /home/files/
Results : ( Here 761 & 762 is the task_id of the submitted sample from the API service )
#task.csv
{"task_url": "http://X.X.X.X:8080/api/abc/v1/task/761"}
{"task_url": "http://X.X.X.X:8080/api/abc/v1/task/762"}
I'm giving the entire folder path (find $1 -type f) to find all the files in the directory to upload the files. Now , I'm using "&" operator to submit/upload the files from the folder which will generate 'task_id' from the API service(stdout) and i wanted that 'task_id'(stdout) to store it in 'task.csv'. But the time taken to upload a file with "&" and without "&" is same. Is there any more method to do the submission parallel/faster? Any suggestions please ?
anubhava suggests using xargs with -P option:
find "$1" -type f -print0 |
xargs -0 -P 5 curl -s -F file=#- http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
However, appending to the same file in parallel is generally a bad idea: You really need to know a lot about how this version of the OS buffers output for that to be safe. This example shows why:
#!/bin/bash
size=3000
myfile=/tmp/myfile$$
rm $myfile
echo {a..z} | xargs -P26 -n1 perl -e 'print ((shift)x'$size')' >> $myfile
cat $myfile | perl -ne 'for(split//,$_){
if($_ eq $l) {
$c++
} else {
/\n/ and next;
print $l,1+$c," "; $l=$_; $c=0;
}
}'
echo
With size=10 you will always get (order may differ):
1 d10 i10 c10 n10 h10 x10 l10 b10 u10 w10 t10 o10 y10 z10 p10 j10 q10 s10 v10 r10 k10 e10 m10 f10 g10
Which means that the file contains 10 d's followed by 10 i's followed by 10 c's and so on. I.e. no mixing of the output from the 26 jobs.
But change it to size=30000 and you get something like:
1 c30000 d30000 l8192 g8192 t8192 g8192 t8192 g8192 t8192 g5424 t5424 a8192 i16384 s8192 i8192 s8192 i5424 s13616 f16384 k24576 p24576 n8192 l8192 n8192 l13616 n13616 r16384 u8192 r8192 u8192 r5424 u8192 o16384 b8192 j8192 b8192 j8192 b8192 j8192 b5424 a21808 v8192 o8192 v8192 o5424 v13616 j5424 u5424 h16384 p5424 h13616 x8192 m8192 k5424 m8192 q8192 f8192 m8192 f5424 m5424 q21808 x21808 y30000 e30000 w30000
First 30K c's, then 30K d's, then 8k l's, then 8K g's, 8K t's, then another 8k g's, and so on. I.e. the 26 outputs were mixed together. Very non-good.
For that reason I will advice against appending to the same file in parallel: There is a risk of race condition, and it can often be avoided.
In your case you can simply use GNU Parallel instead of xargs, because GNU Parallel guards against this race condition:
find "$1" -type f -print0 |
parallel -0 -P 5 curl -s -F file=#{} http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
You can use xargs with -P option:
find "$1" -type f -print0 |
xargs -0 -P 5 -I{} curl -s -F file='#{}' http://X.X.X.X:8080/api/abc/v1/upload >> task.csv
This will reduce total execution time by launching 5 curl process in parallel.
The command inside command substitution, $(), runs in a subshell; so here you are sending the curl command in the background of that subshell, not the parent shell.
Get rid of the command substitution, and Just do:
curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload >task.csv &
You're telling the shell to parallelize inside of a command substitution ($()). That's not going to do what you want. Try this instead:
#!/bin/bash
for i in $(find $1 -type f);do
curl -s -F file=#$i http://X.X.X.X:8080/api/abc/v1/upload &
done > task.csv
#uncomment next line if you want the script to pause until the last curl is done
#wait
This puts the curl into the background and saves its output into task.csv.

Unix shell group files extensions by size

i want to group and sort files sizes by extensions in current and all subfolders
for i in `find . -type f -name '*.*' | sed 's/.*\.//' | sort | uniq `
do
echo $i
done
got code which gets all files extensions in current and all subfolders
now i need to sum all files sizes by those extensions and print them
Any ideas how this could be done?
example output:
sh (files sizes sum by sh extension)
pl (files sizes sum by pl extension)
c (files sizes sum by c extension)
I would use a loop, so that you can provide a different extension every time and find just the files with that extension:
for extension in c php pl ...
do
find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc
done
The sum is based on the answer in total size of group of files selected with 'find'.
In case you want the very specific output you mention in the question, you can store the last line and then print it together with the extension name:
for extension in c php pl ...
do
sum=$(find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc | tail -1)
echo "$extension ($sum)"
done
If you don't want to name file extensions beforehand, the stat(1) program has a format option (-c) that can make tasks like this a bit easier, if you're on a system that includes it, and xargs(1) usually helps performance.
#!/bin/sh
find . -type f -name '*.*' -print0 |
xargs -0 stat -c '%s %n' |
sed 's/ .*\./ /' |
awk '
{
sums[$2] += $1
}
END {
for (key in sums) {
printf "%s %d\n", key, sums[key]
}
}'

Resources