Count and remove extraneous files (bash) - bash

I am getting stuck on finding a succint solution to the following.
In a given directory, I have the following files:
10_MIDAP.nii.gz
12_MIDAP.nii.gz
14_MIDAP.nii.gz
16_restAP.nii.gz
18_restAP.nii.gz
I am only supposed to have two "MIDAP" files and one "restAP" file. The additional files may not contain the full data, so I need to remove them. These are likely going to be smaller in size and/or the earlier sequence number (e.g., 10).
I know how to count / echo the number of files:
MIDAP=`find $DATADIR -name "*MIDAP.nii.gz" | wc -l`
RestAP=`find $DATADIR -name "*restAP.nii.gz" | wc -l`
echo "MIDAP files = $MIDAP"
echo "RestAP files = $RestAP"
Any suggestions on how to succinctly remove the unneeded files, such that I end up with two "MIDAP" files and one "restAP" (in cases where there are extraneous files)? As of now, imagining it would be something like this...
if (( $MIDAP > 2 )); then
...magic happens
fi
Thanks for any advice!

here is an approach
create test files
$ for i in {1..10}; do touch ${i}_restAP; touch ${i}_MIDAP; done
sort based on numbers, and remove the top N-1 (or N-2) files.
$ find . -name '*restAP*' | sort -V | head -n -1 | xargs rm
$ find . -name '*MIDAP*' | sort -V | head -n -2 | xargs rm
$ ls -1
10_MIDAP
10_restAP
9_MIDAP
you may want to change the sort if based on file size.

Related

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.
Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

grep files against a list only containing numbers

I have several files (~70000) that have numbers in the name, a couple of examples would be 991000_Metatissue.qsub.file 828000_Metatissue.qsub.file, and then I have another file (files_failed.txt) with a bunch of numbers that I would use to grep. This list looks like this:
4578000
458000
4582000
527000
5288000
5733000
653000
6548000
6663000
I have tried with: ls -1 *.qsub.file | grep -F -f files_failed.txt - and even doing this:
ls -1 *.qsub.file > files_to_submit.txt
grep -F -f files_failed.txt files_to_submit.txt
But always got all the qsub.files...
grep -f isn't well composed (see GNU bug 16305), so I recommend using awk instead:
find . -name '*_*.qsub.file' |awk -F_ '
NR == FNR { failed[$NR] = 1; next }
$1 in failed
' files_failed.txt /dev/stdin
This uses find to locate the files in question, piping them into awk. Before awk processes that, it reads files_failed.txt and stores the values into an associated array (aka dictionary or hash) when the line number (NR, number of records so far) equals the line number of the current file (FNR), meaning it's the first file read. If the first column (the number of the file since we delimited by _) is in that array, it was a failure. AWK's default action on a stanza is to print it, so you will get a list of those failed files.
Note the lack of regular expressions! On a big directory, this is much faster than grep -F -f …, which itself is much faster than grep -f …, even assuming the aforementioned bug is fixed.
You should be using find and you need to modify your "patterns". Here is one way that should work:
# List all files ending in "qsub.file"
find . -name '*.qsub.file' |
# Add ./ and _ to each number to make the match exact
grep -F -f <(sed -e 's:^:./:' -e 's/$/_/' files_failed.txt)
70000 files is too much for a ls, you should use find instead.
And I prefer invert the logic, list just what a want instead of list all and then filter.
Something like
while read line; do find -iname $line_Metatissue.qsub.file; done < files_failed.txt
If you need the exit in another file?
while read line; do find -iname $line_Metatissue.qsub.file; done < files_failed.txt >> files_to_submit.txt
You can use the below script:-
ls -1 *.qsub.file > filelist.txt
while read pattern
do
filefound=$(grep $pattern filelist.txt)
if [ "$filefound" != "" ]; then
echo "File Found : $filefound"
fi
done < files_failed.txt
Second option:-
while read pattern
do
find . -name "$pattern*.qsub.file" >> filefound.txt
done < files_failed.txt
All your files will be stored in file filefound.txt

How to read CSV file stored in variable

I want to read a CSV file using Shell,
But for some reason it doesn't work.
I use this to locate the latest added csv file in my csv folder
lastCSV=$(ls -t csv-output/ | head -1)
and this to count the lines.
wc -l $lastCSV
Output
wc: drupal_site_livinglab.csv: No such file or directory
If I echo the file it says: drupal_site_livinglab.csv
Your issue is that you're one directory up from the path you are trying to read. The quick fix would be wc -l "csv-output/$lastCSV".
Bear in mind that parsing ls -t though convenient, isn't completely robust, so you should consider something like this to protect you from awkward file names:
last_csv=$(find csv-output/ -mindepth 1 -maxdepth 1 -printf '%T#\t%p\0' |
sort -znr | head -zn1 | cut -zf2-)
wc -l "$last_csv"
GNU find lists all files along with their last modification time, separating the output using null bytes to avoid problems with awkward filenames.
if you remove -maxdepth 1, this will become a recursive search
GNU sort arranges the files from newest to oldest, with -z to accept null byte-delimited input.
GNU head -z returns the first record from the sorted list.
GNU cut -z at the end discards the timestamp, leaving you with only the filename.
You can also replace find with stat (again, this assumes that you have GNU coreutils):
last_csv=$(stat csv-output/* --printf '%Y\t%n\0' | sort -znr | head -zn1 | cut -zf2-)

Get total size of a list of files in UNIX

I want to run a find command that will find a certain list of files and then iterate through that list of files to run some operations. I also want to find the total size of all the files in that list.
I'd like to make the list of files FIRST, then do the other operations. Is there an easy way I can report just the total size of all the files in the list?
In essence I am trying to find a one-liner for the 'total_size' variable in the code snippet below:
#!/bin/bash
loc_to_look='/foo/bar/location'
file_list=$(find $loc_to_look -type f -name "*.dat" -size +100M)
total_size=???
echo 'total size of all files is: '$total_size
for file in $file_list; do
# do a bunch of operations
done
You should simply be able to pass $file_list to du:
du -ch $file_list | tail -1 | cut -f 1
du options:
-c display a total
-h human readable (i.e. 17M)
du will print an entry for each file, followed by the total (with -c), so we use tail -1 to trim to only the last line and cut -f 1 to trim that line to only the first column.
Methods explained here have hidden bug. When file list is long, then it exceeds limit of shell comand size. Better use this one using du:
find <some_directories> <filters> -print0 | du <options> --files0-from=- --total -s|tail -1
find produces null ended file list, du takes it from stdin and counts.
this is independent of shell command size limit.
Of course, you can add to du some switches to get logical file size, because by default du told you how physical much space files will take.
But I think it is not question for programmers, but for unix admins :) then for stackoverflow this is out of topic.
This code adds up all the bytes from the trusty ls for all files (it excludes all directories... apparently they're 8kb per folder/directory)
cd /; find -type f -exec ls -s \; | awk '{sum+=$1;} END {print sum/1000;}'
Note: Execute as root. Result in megabytes.
The problem with du is that it adds up the size of the directory nodes as well. It is an issue when you want to sum up only the file sizes. (Btw., I feel strange that du has no option for ignoring the directories.)
In order to add the size of files under the current directory (recursively), I use the following command:
ls -laUR | grep -e "^\-" | tr -s " " | cut -d " " -f5 | awk '{sum+=$1} END {print sum}'
How it works: it lists all the files recursively ("R"), including the hidden files ("a") showing their file size ("l") and without ordering them ("U"). (This can be a thing when you have many files in the directories.) Then, we keep only the lines that start with "-" (these are the regular files, so we ignore directories and other stuffs). Then we merge the subsequent spaces into one so that the lines of the tabular aligned output of ls becomes a single-space-separated list of fields in each line. Then we cut the 5th field of each line, which stores the file size. The awk script sums these values up into the sum variable and prints the results.
ls -l | tr -s ' ' | cut -d ' ' -f <field number> is something I use a lot.
The 5th field is the size. Put that command in a for loop and add the size to an accumulator and you'll get the total size of all the files in a directory. Easier than learning AWK. Plus in the command substitution part, you can grep to limit what you're looking for (^- for files, and so on).
total=0
for size in $(ls -l | tr -s ' ' | cut -d ' ' -f 5) ; do
total=$(( ${total} + ${size} ))
done
echo ${total}
The method provided by #Znik helps with the bug encountered when the file list is too long.
However, on Solaris (which is a Unix), du does not have the -c or --total option, so it seems there is a need for a counter to accumulate file sizes.
In addition, if your file names contain special characters, this will not go too well through the pipe (Properly escaping output from pipe in xargs
).
Based on the initial question, the following works on Solaris (with a small amendment to the way the variable is created):
file_list=($(find $loc_to_look -type f -name "*.dat" -size +100M))
printf '%s\0' "${file_list[#]}" | xargs -0 du -k | awk '{total=total+$1} END {print total}'
The output is in KiB.

How to copy files flattening into one dir using find and xargs (GNU Bash)

I need to find all files (e.g. with extension ABC) and copy it into one directory but creating unique filenames not to overwrite any files with the potential same name.
Something like this:
find /tmp -name \*.ABC | xargs cp '{}' somedir/$(echo {} | md5sum | cut -c1-6){} \;
Creating files like:
b786af1_original_name.ABC
a7af335_original_name_2.ABC
...
The command above obviously cannot work because $( ... ) statement is getting evaluated once. I need to evaluate it for every file name.
How to do that?
Why not read?
find /tmp -name \*.ABC | while read i; do cp $i $(basename $i | md5sum | cut -c1-6)$(basename $i); done;
How about a random int based on the current nanosecond?
date +%N | sed -e 's/000$//' -e 's/^0//'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Strip off leading and trailing zeroes, if present.
# Length of generated integer depends on
# + how many zeroes stripped off.
Probability of getting the same file with the same name is very small with this method.
Source: http://tldp.org/LDP/abs/html/timedate.html
EDIT: actually this will just give you the same problem. Does it need to be a one liner?
For the record, here's a weird-filename-proofed version of #Ken's answer:
find /tmp -name \*.ABC -print0 | while IFS= read -r -d $'\0' i; do cp "$i" "$(basename "$i" | md5sum | cut -c1-6)$(basename "$i")"; done
See BashFAQ #20 for details, variants, etc.
Use mktemp to generate unique names:
find /tmp -name \*.ABC | while read f; do
cp "$f" "$(mktemp /destination/dir/XXXXXXXXXX.ABC)"
done
Important sidenote: watch out how big of a part of a hash you're using as an identifier.
If you're using 6 values of 0..F (16 values), that's less than 17 million combinations. So if you have 5000 files that you're identifying with these, you got a 52% chance of having a collision. 7 hex chars yields 4.5%, and 8 hex chars yields 0.3% of collision (for the 5000 files).

Resources