Need a way to gather total size of JAR files - bash

I am new to more advanced bash commands. I need a way to count the size of external libraries in our codeline. There are a few main directories but I also have a spreadsheet with the actual locations of the libraries that need to be included.
I have fiddled with find and du but it is unclear to me how to specify multiple locations. Can I find the size of several hundred jars listed in the spreadsheet instead of approximating with the main directories?
edit: I can now find the size of specific files. I had to export the excel spreadsheet with the locations to a csv. In PSPad I "joined lines" and copy and paste that directly into the list_of_files slot. (find list_of_files | xargs du -hc). I could not get find to utilize a file containing the locations separated by a space/tab/line.
Now I can't tell if replacing list_of_files with list_of_directories will work. It looks like it counts things twice e.g.
1.0M /folder/dummy/image1.jpg
1.0M /folder/dummy/image2.jpg
2.0M /folder/dummy
3.0M /folder/image3.jpg
7.0M /folder
14.0M total
This is fake but if it's counting like this then that is not what I want. The reason I suspect this is because the total I'm getting seems really high.

Do you mean...
find list_of_directories | xargs du -hc
Then, if you want to exactly pipe to du the files that are listed in the spredsheet you need a way to filter them out. Is it a text file or which format?
find `(cat file)` | xargs du -hc
might do it if they are in a txt file as a list separated by spaces. Probably you will have some issues regarding the spaces... You have to quote the filenames.

for fn in `find DIR1 DIR2 FILE1 -name *.jar`; do du $fn; done | awk '{TOTAL += $1} END {print TOTAL}'
You can specify your files and directories in place of DIR1, DIR2, FILE1, etc. You can list their individual sizes by removing the piped awk command.

Related

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.
Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Bash script to filter out files based on size

I have a lot of log files which are all unique file names, however based on the size, many are exactly the same content (bot generated attacks).
I need to filter out duplicate file sizes or include only unique file sizes.
95% are not unique and I can see the file sizes, so could manually choose sizes to filter out.
I have worked out
find . -size 48c | xargs ls -lSr -h
Will give me only logs of 48 bytes and could continue with this method to create a long string of included files
uniq does not support file size, as far as I can tell
find does have a not option, this may be where I should be looking?
How can I efficiently filter out the known duplicates?
Or is there a different method to filter and display logs based on unique size only.
One solution is:
find . -type f -ls | awk '!x[$7]++ {print $11}'
$7 is the filesize column; $11 is the pathname.
Since you are using find I assume there are subdirectories, which you don't want to list.
The awk part prints the path of the first file with a given size (only).
HTH
You nearly had it, does going with this provide a solution:
find . -size 48c | xargs

Count files matching a pattern with GREP

I am on a windows server, and have installed GREP for win. I need to count the number of file names that match (or do not match) a specific pattern. I don't really need all the filenames listed out, I just need a total count of how many matched. The tree structure that I will be searching is fairly large, so I'd like to conserve as much processing as possible.
I'm not very familiar with grep, but it looks like i can use the -l option to search for file names matching a given pattern. So, for example, I could use
$grep -l -r this *.doc*
to search for all MS word files in the current folder and all child folders. This would then return to me a listing of all those files. i don't want the listing, i just want a count of how many it found. Is this possible with GREP...or another tool?
thanks!
On linux you would use
grep -l -r this .doc | wc -l
to get the number of printed lines
Although -r .doc does not search all word files, you would use --include "*doc" .
And if you do not have wc, you can use grep again, to count the number of matches:
grep -l -r --include "*doc" this . | grep -c .

Remove identical files in UNIX

I'm dealing with a large amount (30,000) files of about 10MB in size. Some of them (I estimate 2%) are actually duplicated, and I need to keep only a copy for every duplicated pair (or triplet).
Would you suggest me an efficient way to do that? I'm working on unix.
you can try this snippet to get all duplicates first before removing.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
I would write a script to create a hash of every file. You could store the hashes in a set, iterate over the files, and where a file hashes to a value already found in the set, delete the file. This would be trivial to do in Python, for example.
For 30,000 files, at 64 bytes per hash table entry, you're only looking at about 200 megabytes.
Find possible duplicate files:
find DIR -type f -exec sha1sum "{}" \; | sort | uniq -d -w40
Now you can use cmp to check that the files are really identical.
There is an existing tool for this: fdupes
Restoring a solution from an old deleted answer.
Write a script that first compares file sizes, then MD5 checksums (caching them, of course) and, if you're very anxious about losing data, bites the bullet and actually compares duplicate candidates byte for byte. If you have no additional knowledge about how the files came to be etc., it can't really be done much more efficiently.
Save all the file names in an array. Then traverse the array. In each iteration, compare the file contents with the other file's contents by using the command md5sum. If the MD5 is the same, then remove the file.
For example, if file b is a duplicate of file a, the md5sum will be the same for both the files.

Resources