BASH counting numbers on all lines - bash

I need help with my bash script.
The task is counting total size of files in directory. I already did it ( using ls, awk and grep). My output may look like this for example:
1326
40
598
258
12
$
These numbers means size of files in directory.
I need to count them all and I stuck here.
So I would be really grateful if someone could tell me how to count them all (and find the total size of files in directory)
Thank you

well, in Unix shell programing, never forget the most basic philosophy, being:
Keep It Simple, Stupid!
which is the French for Use the right tool that does one thing, but does it well. You can achieve to do what you want with a mix of ls or find, and grep, and awk, and cut, and sed and …, or you can use the tool that has been designed for calculating files sizes.
And that tool is du:
% du -chs /directory
4.3G /directory
4.3G total
Though, it will give the total size of every file within every directory of the given path. If you want to limit it to just the files within the directory (and not the ones below), you can do:
% du -chsS /directory
3G /directory
3G total
For more details, refer to the manual page [man du], and here are the arguments I'm using in the answer:
-c, --total produce a grand total
-h, --human-readable print sizes in human readable format (e.g., 1K 234M 2G)
-s, --summarize display only a total for each argument
if you remove -s you'll have the size details for each file of the directory, if you remove -h you'll have full size in bytes (instead of rounding into a more readable form), if you remove -c you won't have the grand total (i.e. the total line at the end).
HTH

awk to the rescue!
awk '$1+0==$1{sum+=$1; count++} END{print sum, count}'
adds up and counts all the numbers ($1+0==$1 for a number, but not for a string) and print them the sum and count when done.

Related

Is there any faster way to grep billions of mismatch patterns in more than one file?

I wrote a script that calculates all possible mismatch patterns (depending a case) like the two below (please look at grep command) and writes an output file as sh with billion lines like this one:
LC_ALL=C grep -ch "AAAAAAAC[A-Z][A-Z][A-Z][A-Z]CGA[A-Z][A-Z]G\|C[A-Z][A-Z]TCG[A-Z][A-Z][A-Z][A-Z]GTTTTTTT" regions_A regions_B
The next step is to execute all this billions grep lines and write an output.
In order to run it as fast as I can, I look only for ASCII code (all my characters are ASCII) using LC_ALL. Moreover, I split the huge grep file in 16 parts and run them separately using 16 threads.
Does anybody know any faster method to grep my patterns?
Any help would be appreciated.
Thank you in advance!

Bash script to filter out files based on size

I have a lot of log files which are all unique file names, however based on the size, many are exactly the same content (bot generated attacks).
I need to filter out duplicate file sizes or include only unique file sizes.
95% are not unique and I can see the file sizes, so could manually choose sizes to filter out.
I have worked out
find . -size 48c | xargs ls -lSr -h
Will give me only logs of 48 bytes and could continue with this method to create a long string of included files
uniq does not support file size, as far as I can tell
find does have a not option, this may be where I should be looking?
How can I efficiently filter out the known duplicates?
Or is there a different method to filter and display logs based on unique size only.
One solution is:
find . -type f -ls | awk '!x[$7]++ {print $11}'
$7 is the filesize column; $11 is the pathname.
Since you are using find I assume there are subdirectories, which you don't want to list.
The awk part prints the path of the first file with a given size (only).
HTH
You nearly had it, does going with this provide a solution:
find . -size 48c | xargs

grep - how to output progress bar or status

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).
I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .
Would this require modifying source code of grep?
Ideal command is:
grep -e "STRING" --results="FILE.txt"
and the progress:
[curr file being searched], number x/total number of files
written to STDOUT or STDERR
This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.
If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)
In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.
One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.
In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[#]}
for ((i=0; i<total; i+=100)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[#]:i:100}" >>results.txt
done
For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.
You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.
The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:
find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt
(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)
Try the parallel program
find * -name \*.[ch] | parallel -j5 --bar '(grep grep-string {})' > output-file
Though I found this to be slower than a simple
find * -name \*.[ch] | xargs grep grep-string > output-file
This command show the progress (speed and offset), but not the total amount. This could be manually estimated however.
dd if=/input/file bs=1c skip=<offset> | pv | grep -aob "<string>"
I'm pretty sure you would need to alter the grep source code. And those changes would be huge.
Currently grep does not know how many lines a file as until it's finished parsing the whole file. For your requirement it would need to parse the file 2 times or a least determine the full line count any other way.
The first time it would determine the line count for the progress bar. The second time it would actually do the work an search for your pattern.
This would not only increase the runtime but violate one of the main UNIX philosophies.
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features". (source)
There might be other tools out there for your need, but afaik grep won't fit here.
I normaly use something like this:
grep | tee "FILE.txt" | cat -n | sed 's/^/match: /;s/$/ /' | tr '\n' '\r' 1>&2
It is not perfect, as it does only display the matches, and if they to long or differ to much in length there are errors, but it should provide you with the general idea.
Or a simple dots:
grep | tee "FILE.txt" | sed 's/.*//' | tr '\n' '.' 1>&2

How to split a large file into many small files using bash? [duplicate]

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 6 years ago.
I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.
Perhaps, I can do this using:
cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4
But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all with 3241 lines, and we want to split it into 7 files, each with 463 lines).
Is there a better way to do this?
When you want to split a file, use split:
split -l 500 all all
will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:
split -l $(( $( wc -l < all ) / 4 + 1 )) all all
Look into the split command, it should do what you want (and more):
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names.
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files. See below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
SIZE is an integer and optional unit (example: 10M is 10*1024*1024). Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines
l/K/N output Kth of N to stdout without splitting lines
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n command-line argument to specify the number of chucks, the small* files do not contain exactly 500 lines when using split.
$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
583 small1
528 small2
445 small3
444 small4
2000 total
Alternatively, you could use GNU parallel:
$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
500 small1
500 small2
500 small3
500 small4
2000 total
As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.

Need a way to gather total size of JAR files

I am new to more advanced bash commands. I need a way to count the size of external libraries in our codeline. There are a few main directories but I also have a spreadsheet with the actual locations of the libraries that need to be included.
I have fiddled with find and du but it is unclear to me how to specify multiple locations. Can I find the size of several hundred jars listed in the spreadsheet instead of approximating with the main directories?
edit: I can now find the size of specific files. I had to export the excel spreadsheet with the locations to a csv. In PSPad I "joined lines" and copy and paste that directly into the list_of_files slot. (find list_of_files | xargs du -hc). I could not get find to utilize a file containing the locations separated by a space/tab/line.
Now I can't tell if replacing list_of_files with list_of_directories will work. It looks like it counts things twice e.g.
1.0M /folder/dummy/image1.jpg
1.0M /folder/dummy/image2.jpg
2.0M /folder/dummy
3.0M /folder/image3.jpg
7.0M /folder
14.0M total
This is fake but if it's counting like this then that is not what I want. The reason I suspect this is because the total I'm getting seems really high.
Do you mean...
find list_of_directories | xargs du -hc
Then, if you want to exactly pipe to du the files that are listed in the spredsheet you need a way to filter them out. Is it a text file or which format?
find `(cat file)` | xargs du -hc
might do it if they are in a txt file as a list separated by spaces. Probably you will have some issues regarding the spaces... You have to quote the filenames.
for fn in `find DIR1 DIR2 FILE1 -name *.jar`; do du $fn; done | awk '{TOTAL += $1} END {print TOTAL}'
You can specify your files and directories in place of DIR1, DIR2, FILE1, etc. You can list their individual sizes by removing the piped awk command.

Resources