Print out a statement before each output of my script - bash

I have a script that checks each file in a folder for the word "Author" and then prints out the number of times, one line per file, in order from highest to lowest. In total I have 825 files. An example output would be
53
22
17
I want to make it so that I print out something before each number on every line. This will be the following hotel_$i so the above example would now be:
hotel_1 53
hotel_2 22
hotel_3 17
I have tried doing this using a for loop in my shell script:
for i in {1..825}
do
echo "hotel_$i"
find . -type f -exec bash -c 'grep -wo "Author" {} | wc -l' \; | sort -nr
done
but this basically prints out hotel_1, then does the search and sort for all 825 files, then hotel_2 repeats the search and sort and so on. How do I make it so that it prints before every output?

You can use the paste command, which combines lines from different files:
paste <(printf 'hotel_%d\n' {1..825}) \
<(find . -type f -exec bash -c 'grep -wo "Author" {} | wc -l' \; | sort -nr)
(Just putting this on two lines for readability, can be a one-liner without the \.)
This combines paste with process substitution, making the output of a command look like a file (a named pipe) to paste.
The first command prints hotel_1, hotel_2 etc. on a separate line each, and the second command is your find command.
For short input files, the output looks like this:
hotel_1 7
hotel_2 6
hotel_3 4
hotel_4 3
hotel_5 3
hotel_6 2
hotel_7 1
hotel_8 0
hotel_9 0

Related

Count the lines from output using pipeline

I am trying to count how many files have words with the pattern [Gg]reen.
#!/bin/bash
for File in `ls ./`
do
cat ./$File | egrep '[Gg]reen' | sed -n '$='
done
When I do this I get this output:
1
1
3
1
1
So I want to count the lines to get in total 5. I tried using wc -l after the sed but it didn't work; it counted the lines in all the files. I tried to use >file.txt but it didn't write anything on it. And when I use >> instead it writes but when I execute the shell it appends the lines again.
Since according to your question, you want to know how many files contain a pattern, you are interested in the number of files, not the number of pattern occurances.
For instance,
grep -l '[Gg]reen' * | wc -l
would produce the number of files which contain somewhere green or Green as a substring.

How to create argument variable in bash script

I am trying to write a script such that I can identify number of characters of the n-th largest file in a sub-directory.
I was trying to assign n and the name of sub-directory into arguments like $1, $2.
Current directory: Greetings
Sub-directory: language_files, others
Sub-directory: English, German, French
Files: Goodmorning.csv, Goodafternoon.csv, Goodevening.csv ….
I would be at directory “Greetings”, while I indicating subdirectory (English, German, French), it would show the nth-largest file in the subdirectory indicated and calculate number of characters as well.
For instance, if I am trying to figure out number of characters of 2nd largest file in English, I did:
langs=$1
n=$2
for langs in language_files/;
Do count=$(find language_files/$1 name "*.csv" | wc -m | head -n -1 | sort -n -r | sed -n $2(p))
Done | echo "The file has $count bytes!"
The result I wanted was:
$ ./script1.sh English 2
The file has 1100 bytes!
The main problem of all the issue is the fact that I don't understand how variables and looping work in bash script.
no need for looping
find language_files/"$1" -name "*.csv" | xargs wc -m | sort -nr | sed -n "$2{p;q}"
for byte counting you should use -c, since -m is for char counting (it may be the same for you).
You don't use the loop variable in the script anyway.
Bash loops are interesting. You are encouraged to learn more about them when you have some time. However, this particular problem might not need a loop. Set lang (you can call it langs if you prefer) and n appropriately, and then try this:
count=$(stat -c'%s %n' language_files/$lang/* | sort -nr | head -n$n | tail -n1 | sed -re 's/^[[:space:]]*([[:digit:]]+).*/\1/')
That should give you the $count you need. Then you can echo it however you like.
EXPLANATION
If you wish to learn how it works:
The stat command outputs various statistics about the named file (or files), in this case %s the file's size and %n the file's name.
The head and tail output respectively the first and last several lines of a file. Together, they select a specific line from the file
The sed command screens a certain part of the line. (You can use cut, instead, if you prefer.)
If you wish to be cleverer, then you can optimize as #karafka has done.

Delete lines X to Y using Mac Unix Sed

Command line on a Mac. Have some text files. Want to remove certain lines from a group of files, then cat the remaining text of the file to a new merged file. Currently have the following attempt:
for file in *.txt;
do echo $file >> tempfile.html;
echo ''>>tempfile.html;
cat $file>>tempfile.html;
find . -type f -name 'tempfile.html' -exec sed -i '' '3,10d' {} +;
find . -type f -name 'tempfile.html' -exec sed -i '' '/<ACROSS>/,$d' {} +;
# ----------------
# some other stuff
# ----------------
done;
I am extracting a section of text from a bunch of files and concating them all together, but still need to know from which file each selection originated. First I concat the name of the file then (supposedly) the selection of text from each file. then repeat the process.
Plus, I need to leave the original text files in place for other purposes.
So the concatinated file would be:
filename1.txt
text-selection
more_text
filename2.txt
even-more-text
text-text-test-test
The first SED is supposed to delete from line 3 to line 10. The second is supposed to delete from the line containing to the end of the file.
However, what happens is the first deletes everything in the tempfile. The second one was doing nothing. (each were tested separately)
What am I doing wrong?
I must be missing something. Even trying -- what appears to be -- a very simple example does not work either. My hope was, the following example, would delete lines 3-10, but save the rest of the file to test.txt.
sed '3,10d' nxd2019-01-06.txt > test.txt
Your invocation of find will attempt to run sed with as many files as possible per call. But note: Addresses in sed do not address lines in each input file, they address the whole input of sed (which can consist out of many input files)
Try this:
> a.txt cat <<EOF
1
2
EOF
> b.txt cat <<EOF
3
4
EOF
Now try this:
sed 1d a.txt b.txt
2
3
4
As you can see, sed removed the first line from a.txt, not from b.txt
The problem in your case, is the second invocation of find. If will remove everything from the first occurrence of ACROSS until the last line in the last file found by find This will effectively remove the content from all but the first tempfile.html.
Having that the remaining logic in your script is working, you should just change the find invocations to:
find . -type f -name 'tempfile.html' -exec sed -i '' '3,10d' {} \;
find . -type f -name 'tempfile.html' -exec sed -i '' '/<ACROSS>/,$d' {} \;
This would call sed once per input file.

How to make my shell script check every folder in a directory for a word and then rank the outputs?

I have a folder reviews_folder that contains lots of files, such as hotel_217616.dat. I have written a script countreviews.sh to check the number of times the word "Author" appears in each file and then print the number out for each respective file. Here is my script:
grep -r "<Author>" "#1"
I cannot write reviews_folder in the shell code, it must take it as an argument in the command line, hence #1. The number of time my word appears in each file must then be ranked from highest to lowest, for example
-- run script --
49
23
17
However, when I run my script it says "#1: No such file or directory"; why isn't it replacing #1 with reviews_folder when I type:
./countreviews.sh reviews_folder
My countreviews.sh is sitting in the same directory as my reviews_folder, which contains the files I will be checking if that matters.
First off, the positional parameter is $1 and not #1.
Secondly, your script doesn't really "count the number of time the word Author appears"; it looks literally for <Author>, including the angle brackets.
I assume you wanted word boundaries, as in \<Author\>.
grep -r just lists all matching lines, prepended by filenames. You want only the count, and sorted. To do this, you can do
grep -rwch 'Author'
-w searches for word matches
-c returns a match count per file
-h suppresses writing the file name
And to sort the output, you pipe it to sort:
grep -rwch 'Author' | sort -nr
-n is for "numerical sort", and -r for "reverse", so the largest number is first.
Notice how this still only counts how many lines matched "Author"; if there is a line with five matches, it is counted only as one by grep -c.
To properly count every single occurrence, you could to this:
find . -type f -exec bash -c 'grep -wo "Author" {} | wc -l' \; | sort -nr
find . -type f finds recursively all files.
-exec executes a command for each file found. Because we use a pipe in that command, we have to spawn a subshell with bash -c.
grep -wo "Author" {} | wc -l finds every match of Author and prints it on a separate line; wc -l then counts the lines.
After this happened for all files, sort -nr again sorts the results.
ITYM $1, not #1
..........................................

Get total size of a list of files in UNIX

I want to run a find command that will find a certain list of files and then iterate through that list of files to run some operations. I also want to find the total size of all the files in that list.
I'd like to make the list of files FIRST, then do the other operations. Is there an easy way I can report just the total size of all the files in the list?
In essence I am trying to find a one-liner for the 'total_size' variable in the code snippet below:
#!/bin/bash
loc_to_look='/foo/bar/location'
file_list=$(find $loc_to_look -type f -name "*.dat" -size +100M)
total_size=???
echo 'total size of all files is: '$total_size
for file in $file_list; do
# do a bunch of operations
done
You should simply be able to pass $file_list to du:
du -ch $file_list | tail -1 | cut -f 1
du options:
-c display a total
-h human readable (i.e. 17M)
du will print an entry for each file, followed by the total (with -c), so we use tail -1 to trim to only the last line and cut -f 1 to trim that line to only the first column.
Methods explained here have hidden bug. When file list is long, then it exceeds limit of shell comand size. Better use this one using du:
find <some_directories> <filters> -print0 | du <options> --files0-from=- --total -s|tail -1
find produces null ended file list, du takes it from stdin and counts.
this is independent of shell command size limit.
Of course, you can add to du some switches to get logical file size, because by default du told you how physical much space files will take.
But I think it is not question for programmers, but for unix admins :) then for stackoverflow this is out of topic.
This code adds up all the bytes from the trusty ls for all files (it excludes all directories... apparently they're 8kb per folder/directory)
cd /; find -type f -exec ls -s \; | awk '{sum+=$1;} END {print sum/1000;}'
Note: Execute as root. Result in megabytes.
The problem with du is that it adds up the size of the directory nodes as well. It is an issue when you want to sum up only the file sizes. (Btw., I feel strange that du has no option for ignoring the directories.)
In order to add the size of files under the current directory (recursively), I use the following command:
ls -laUR | grep -e "^\-" | tr -s " " | cut -d " " -f5 | awk '{sum+=$1} END {print sum}'
How it works: it lists all the files recursively ("R"), including the hidden files ("a") showing their file size ("l") and without ordering them ("U"). (This can be a thing when you have many files in the directories.) Then, we keep only the lines that start with "-" (these are the regular files, so we ignore directories and other stuffs). Then we merge the subsequent spaces into one so that the lines of the tabular aligned output of ls becomes a single-space-separated list of fields in each line. Then we cut the 5th field of each line, which stores the file size. The awk script sums these values up into the sum variable and prints the results.
ls -l | tr -s ' ' | cut -d ' ' -f <field number> is something I use a lot.
The 5th field is the size. Put that command in a for loop and add the size to an accumulator and you'll get the total size of all the files in a directory. Easier than learning AWK. Plus in the command substitution part, you can grep to limit what you're looking for (^- for files, and so on).
total=0
for size in $(ls -l | tr -s ' ' | cut -d ' ' -f 5) ; do
total=$(( ${total} + ${size} ))
done
echo ${total}
The method provided by #Znik helps with the bug encountered when the file list is too long.
However, on Solaris (which is a Unix), du does not have the -c or --total option, so it seems there is a need for a counter to accumulate file sizes.
In addition, if your file names contain special characters, this will not go too well through the pipe (Properly escaping output from pipe in xargs
).
Based on the initial question, the following works on Solaris (with a small amendment to the way the variable is created):
file_list=($(find $loc_to_look -type f -name "*.dat" -size +100M))
printf '%s\0' "${file_list[#]}" | xargs -0 du -k | awk '{total=total+$1} END {print total}'
The output is in KiB.

Resources