How to create argument variable in bash script - bash

I am trying to write a script such that I can identify number of characters of the n-th largest file in a sub-directory.
I was trying to assign n and the name of sub-directory into arguments like $1, $2.
Current directory: Greetings
Sub-directory: language_files, others
Sub-directory: English, German, French
Files: Goodmorning.csv, Goodafternoon.csv, Goodevening.csv ….
I would be at directory “Greetings”, while I indicating subdirectory (English, German, French), it would show the nth-largest file in the subdirectory indicated and calculate number of characters as well.
For instance, if I am trying to figure out number of characters of 2nd largest file in English, I did:
langs=$1
n=$2
for langs in language_files/;
Do count=$(find language_files/$1 name "*.csv" | wc -m | head -n -1 | sort -n -r | sed -n $2(p))
Done | echo "The file has $count bytes!"
The result I wanted was:
$ ./script1.sh English 2
The file has 1100 bytes!
The main problem of all the issue is the fact that I don't understand how variables and looping work in bash script.

no need for looping
find language_files/"$1" -name "*.csv" | xargs wc -m | sort -nr | sed -n "$2{p;q}"
for byte counting you should use -c, since -m is for char counting (it may be the same for you).
You don't use the loop variable in the script anyway.

Bash loops are interesting. You are encouraged to learn more about them when you have some time. However, this particular problem might not need a loop. Set lang (you can call it langs if you prefer) and n appropriately, and then try this:
count=$(stat -c'%s %n' language_files/$lang/* | sort -nr | head -n$n | tail -n1 | sed -re 's/^[[:space:]]*([[:digit:]]+).*/\1/')
That should give you the $count you need. Then you can echo it however you like.
EXPLANATION
If you wish to learn how it works:
The stat command outputs various statistics about the named file (or files), in this case %s the file's size and %n the file's name.
The head and tail output respectively the first and last several lines of a file. Together, they select a specific line from the file
The sed command screens a certain part of the line. (You can use cut, instead, if you prefer.)
If you wish to be cleverer, then you can optimize as #karafka has done.

Related

How to read CSV file stored in variable

I want to read a CSV file using Shell,
But for some reason it doesn't work.
I use this to locate the latest added csv file in my csv folder
lastCSV=$(ls -t csv-output/ | head -1)
and this to count the lines.
wc -l $lastCSV
Output
wc: drupal_site_livinglab.csv: No such file or directory
If I echo the file it says: drupal_site_livinglab.csv
Your issue is that you're one directory up from the path you are trying to read. The quick fix would be wc -l "csv-output/$lastCSV".
Bear in mind that parsing ls -t though convenient, isn't completely robust, so you should consider something like this to protect you from awkward file names:
last_csv=$(find csv-output/ -mindepth 1 -maxdepth 1 -printf '%T#\t%p\0' |
sort -znr | head -zn1 | cut -zf2-)
wc -l "$last_csv"
GNU find lists all files along with their last modification time, separating the output using null bytes to avoid problems with awkward filenames.
if you remove -maxdepth 1, this will become a recursive search
GNU sort arranges the files from newest to oldest, with -z to accept null byte-delimited input.
GNU head -z returns the first record from the sorted list.
GNU cut -z at the end discards the timestamp, leaving you with only the filename.
You can also replace find with stat (again, this assumes that you have GNU coreutils):
last_csv=$(stat csv-output/* --printf '%Y\t%n\0' | sort -znr | head -zn1 | cut -zf2-)

Using 'find' to select unknown patterns in file names with bash

Let's say I have a directory with 4 files in it.
path/to/files/1_A
path/to/files/1_B
path/to/files/2_A
path/to/files/2_B
I want to create a loop, which on each iteration, does something with two files, a matching X_A and X_B. I need to know how to find these files, which sounds simple enough using pattern matching. The problem is, there are too many files, and I do not know the prefixes aka patterns (1_ and 2_ in the example). Is there some way to group files in a directory based on the first few characters in the filename? (Ultimately to store as a variable to be used in a loop)
You could get all the 3-character prefixes by printing out all the file names, trimming them to three characters, and then getting the unique strings.
find -printf '%f\n' | cut -c -3 | sort -u
Then if you wanted to loop over each prefix, you could write a loop like:
find -printf '%f\n' | cut -c -3 | sort -u | while IFS= read -r prefix; do
echo "Looking for $prefix*..."
find -name "$prefix*"
done

Using grep to find and estimate the total # of shell scripts in the current dir

New to UNIX, currently learning UNIX via secureshell in a class. We've been given a few basic assignments such as creating loops and finding files. Our last assignment asked us to
write code that will estimate the number of shell scripts in the current directory and then print out that total number as "Estimated number of shell script files in this directory:"
Unlike in our previous assignments we are now allowed to use conditional loops, we are encouraged to use grep and wc statements.
On a basic level I know I can enter
ls * .sh
to find all shell scripts in the current directory. Unfortunately, this doesn't estimate the total number or use grep. Hence my question, I imagine he wants us to go
grep -f .sh (or something)
but I'm not exactly sure if I am on the right path and would greatly appreciate any help.
Thank You
You can do it like:
echo "Estimated number of shell script files in this directory:" `ls *.sh | wc -l`
I'd do it this way:
find . -executable -execdir file {} + | egrep '\.sh: | Bourne| bash' | wc -l
Find all files in the current directory (.) which are executable.
For each file, run the file(1) command, which tries to guess what type of file it is (not perfect).
Grep for known patterns: filenames ending with .sh, or file types containing "Bourne" or "bash".
Count lines.
Huhu, there's a trap, .sh file are not always shell script as the extension is not mandatory.
What tells you this is a shell script will be the Shebang #!/bin/*sh ( I put a * as it could be bash, csh, tcsh, zsh, which are shells) at top of line, hence the hint to use grep, so the best answer would be:
grep '^#!/bin/.*sh' * | wc -l
This give output:
sensible-pager:#!/bin/sh
service:#!/bin/sh
shelltest:#!/bin/bash
smbtar:#!/bin/sh
grep works with regular expression by default, so the match #!/bin/.*sh will match files with a line starting (the ^) by #!/bin/ followed by 0 or unlimited characters .* followed by sh
You may test regex and get explanation of them on http://regex101.com
Piping the result to wc -l to get the number of files containing this.
To display the result, backticks or $() in an echo line is ok.
grep -l <string> *
will return a list of all files that contain in the current directory. Pipe that output into wc -l and you have your answer.
Easiest way:
ls | grep .sh > tmp
wc tmp
That will print the number of lines, bytes and charcters of 'tmp' file. But in 'tmp' there's a line for each *.sh file in your working directory. So the number of lines will give an estimated number of shell scripts you have.
wc tmp | awk '{print $1}' # Using awk to filter that output like...
wc -l tmp # Which it returns the number of lines follow by the name of file
But as many people say, the only certain way to know a file is a shell script is by taking a look at the first line an see if there is #!/bin/bash. If you wanna develop it that way, keep in mind:
cat possible_script.x | head -n1 # That will give you the first line.

Get total size of a list of files in UNIX

I want to run a find command that will find a certain list of files and then iterate through that list of files to run some operations. I also want to find the total size of all the files in that list.
I'd like to make the list of files FIRST, then do the other operations. Is there an easy way I can report just the total size of all the files in the list?
In essence I am trying to find a one-liner for the 'total_size' variable in the code snippet below:
#!/bin/bash
loc_to_look='/foo/bar/location'
file_list=$(find $loc_to_look -type f -name "*.dat" -size +100M)
total_size=???
echo 'total size of all files is: '$total_size
for file in $file_list; do
# do a bunch of operations
done
You should simply be able to pass $file_list to du:
du -ch $file_list | tail -1 | cut -f 1
du options:
-c display a total
-h human readable (i.e. 17M)
du will print an entry for each file, followed by the total (with -c), so we use tail -1 to trim to only the last line and cut -f 1 to trim that line to only the first column.
Methods explained here have hidden bug. When file list is long, then it exceeds limit of shell comand size. Better use this one using du:
find <some_directories> <filters> -print0 | du <options> --files0-from=- --total -s|tail -1
find produces null ended file list, du takes it from stdin and counts.
this is independent of shell command size limit.
Of course, you can add to du some switches to get logical file size, because by default du told you how physical much space files will take.
But I think it is not question for programmers, but for unix admins :) then for stackoverflow this is out of topic.
This code adds up all the bytes from the trusty ls for all files (it excludes all directories... apparently they're 8kb per folder/directory)
cd /; find -type f -exec ls -s \; | awk '{sum+=$1;} END {print sum/1000;}'
Note: Execute as root. Result in megabytes.
The problem with du is that it adds up the size of the directory nodes as well. It is an issue when you want to sum up only the file sizes. (Btw., I feel strange that du has no option for ignoring the directories.)
In order to add the size of files under the current directory (recursively), I use the following command:
ls -laUR | grep -e "^\-" | tr -s " " | cut -d " " -f5 | awk '{sum+=$1} END {print sum}'
How it works: it lists all the files recursively ("R"), including the hidden files ("a") showing their file size ("l") and without ordering them ("U"). (This can be a thing when you have many files in the directories.) Then, we keep only the lines that start with "-" (these are the regular files, so we ignore directories and other stuffs). Then we merge the subsequent spaces into one so that the lines of the tabular aligned output of ls becomes a single-space-separated list of fields in each line. Then we cut the 5th field of each line, which stores the file size. The awk script sums these values up into the sum variable and prints the results.
ls -l | tr -s ' ' | cut -d ' ' -f <field number> is something I use a lot.
The 5th field is the size. Put that command in a for loop and add the size to an accumulator and you'll get the total size of all the files in a directory. Easier than learning AWK. Plus in the command substitution part, you can grep to limit what you're looking for (^- for files, and so on).
total=0
for size in $(ls -l | tr -s ' ' | cut -d ' ' -f 5) ; do
total=$(( ${total} + ${size} ))
done
echo ${total}
The method provided by #Znik helps with the bug encountered when the file list is too long.
However, on Solaris (which is a Unix), du does not have the -c or --total option, so it seems there is a need for a counter to accumulate file sizes.
In addition, if your file names contain special characters, this will not go too well through the pipe (Properly escaping output from pipe in xargs
).
Based on the initial question, the following works on Solaris (with a small amendment to the way the variable is created):
file_list=($(find $loc_to_look -type f -name "*.dat" -size +100M))
printf '%s\0' "${file_list[#]}" | xargs -0 du -k | awk '{total=total+$1} END {print total}'
The output is in KiB.

bash: shortest way to get n-th column of output

Let's say that during your workday you repeatedly encounter the following form of columnized output from some command in bash (in my case from executing svn st in my Rails working directory):
? changes.patch
M app/models/superman.rb
A app/models/superwoman.rb
in order to work with the output of your command - in this case the filenames - some sort of parsing is required so that the second column can be used as input for the next command.
What I've been doing is to use awk to get at the second column, e.g. when I want to remove all files (not that that's a typical usecase :), I would do:
svn st | awk '{print $2}' | xargs rm
Since I type this a lot, a natural question is: is there a shorter (thus cooler) way of accomplishing this in bash?
NOTE:
What I am asking is essentially a shell command question even though my concrete example is on my svn workflow. If you feel that workflow is silly and suggest an alternative approach, I probably won't vote you down, but others might, since the question here is really how to get the n-th column command output in bash, in the shortest manner possible. Thanks :)
You can use cut to access the second field:
cut -f2
Edit:
Sorry, didn't realise that SVN doesn't use tabs in its output, so that's a bit useless. You can tailor cut to the output but it's a bit fragile - something like cut -c 10- would work, but the exact value will depend on your setup.
Another option is something like: sed 's/.\s\+//'
To accomplish the same thing as:
svn st | awk '{print $2}' | xargs rm
using only bash you can use:
svn st | while read a b; do rm "$b"; done
Granted, it's not shorter, but it's a bit more efficient and it handles whitespace in your filenames correctly.
I found myself in the same situation and ended up adding these aliases to my .profile file:
alias c1="awk '{print \$1}'"
alias c2="awk '{print \$2}'"
alias c3="awk '{print \$3}'"
alias c4="awk '{print \$4}'"
alias c5="awk '{print \$5}'"
alias c6="awk '{print \$6}'"
alias c7="awk '{print \$7}'"
alias c8="awk '{print \$8}'"
alias c9="awk '{print \$9}'"
Which allows me to write things like this:
svn st | c2 | xargs rm
Try the zsh. It supports suffix alias, so you can define X in your .zshrc to be
alias -g X="| cut -d' ' -f2"
then you can do:
cat file X
You can take it one step further and define it for the nth column:
alias -g X2="| cut -d' ' -f2"
alias -g X1="| cut -d' ' -f1"
alias -g X3="| cut -d' ' -f3"
which will output the nth column of file "file". You can do this for grep output or less output, too. This is very handy and a killer feature of the zsh.
You can go one step further and define D to be:
alias -g D="|xargs rm"
Now you can type:
cat file X1 D
to delete all files mentioned in the first column of file "file".
If you know the bash, the zsh is not much of a change except for some new features.
HTH Chris
Because you seem to be unfamiliar with scripts, here is an example.
#!/bin/sh
# usage: svn st | x 2 | xargs rm
col=$1
shift
awk -v col="$col" '{print $col}' "${#--}"
If you save this in ~/bin/x and make sure ~/bin is in your PATH (now that is something you can and should put in your .bashrc) you have the shortest possible command for generally extracting column n; x n.
The script should do proper error checking and bail if invoked with a non-numeric argument or the incorrect number of arguments, etc; but expanding on this bare-bones essential version will be in unit 102.
Maybe you will want to extend the script to allow a different column delimiter. Awk by default parses input into fields on whitespace; to use a different delimiter, use -F ':' where : is the new delimiter. Implementing this as an option to the script makes it slightly longer, so I'm leaving that as an exercise for the reader.
Usage
Given a file file:
1 2 3
4 5 6
You can either pass it via stdin (using a useless cat merely as a placeholder for something more useful);
$ cat file | sh script.sh 2
2
5
Or provide it as an argument to the script:
$ sh script.sh 2 file
2
5
Here, sh script.sh is assuming that the script is saved as script.sh in the current directory; if you save it with a more useful name somewhere in your PATH and mark it executable, as in the instructions above, obviously use the useful name instead (and no sh).
It looks like you already have a solution. To make things easier, why not just put your command in a bash script (with a short name) and just run that instead of typing out that 'long' command every time?
If you are ok with manually selecting the column, you could be very fast using pick:
svn st | pick | xargs rm
Just go to any cell of the 2nd column, press c and then hit enter
Note, that file path does not have to be in second column of svn st output. For example if you modify file, and modify it's property, it will be 3rd column.
See possible output examples in:
svn help st
Example output:
M wc/bar.c
A + wc/qax.c
I suggest to cut first 8 characters by:
svn st | cut -c8- | while read FILE; do echo whatever with "$FILE"; done
If you want to be 100% sure, and deal with fancy filenames with white space at the end for example, you need to parse xml output:
svn st --xml | grep -o 'path=".*"' | sed 's/^path="//; s/"$//'
Of course you may want to use some real XML parser instead of grep/sed.

Resources