awk issue, summing lines in various files - bash

I have a list of files starting with the word "output", and I want to sum up the total number of rows in all the files.
Here's my strategy:
for f in `find outpu*`;do wc -l $f | awk '{x+=$1}END{print $1}' ; done
Before piping over, if there were a way I could do something like >> to a temporary variable and then run the awk command after, I could accomplish this goal.
Any tips?

use this to see details and sum :
wc -l output*
and this to see only the sum:
wc -l output* | tail -n1 | cut -d' ' -f1

Here is some stuff for fun, check it out:
grep -c . out* | cut -d':' -f2- | paste -sd+ | bc
all lines, including empty ones:
grep -c '' out* | cut -d':' -f2- | paste -sd+ | bc
you can play in grep with conditions on lines in files

Watch out, this find command will only find stuff in your current directory if there is one file matching outpu*.
One way of doing it:
awk 'END{print NR}' $(find 'outpu*')
Provided that there is not an insane amount of matching filenames that overflows the maximum command length limit of your shell.

Related

Shell: Counting lines per column while ignoring empty ones

I am trying to simply count the lines in the .CSV per column, while at the same time ignoring empty lines.
I use below and it works for the 1st column:
cat /path/test.csv | cut -d, -f1 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 8
And below for the 2nd column:
cat /path/test.csv | cut -d, -f2 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 6
But when I try to count 3rd column, it simply Outputs the Total number of lines in the whole .CSV.
cat /path/test.csv | cut -d, -f3 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 33
#Should be: 19?
I've also tried to use awk instead of cut, but get the same issue.
I have tried creating new file thinking maybe it had some spaces in the lines, still the same.
Can someone clarify what is the difference? Betwen reading 1-2 column and the rest?
20355570_01.tif,,
20355570_02.tif,,
21377804_01.tif,,
21377804_02.tif,,
21404518_01.tif,,
21404518_02.tif,,
21404521_01.tif,,
21404521_02.tif,,
,22043764_01.tif,
,22043764_02.tif,
,22095060_01.tif,
,22095060_02.tif,
,23507574_01.tif,
,23507574_02.tif,
,,23507574_03.tif
,,23507804_01.tif
,,23507804_02.tif
,,23507804_03.tif
,,23509247_01.tif
,,23509247_02.tif
,,23509247_03.tif
,,23527663_01.tif
,,23527663_02.tif
,,23527663_03.tif
,,23527908_01.tif
,,23527908_02.tif
,,23527908_03.tif
,,23535506_01.tif
,,23535506_02.tif
,,23535562_01.tif
,,23535562_02.tif
,,23535636_01.tif
,,23535636_02.tif
That happens when input file has DOS line endings (\r\n). Fix your file using dos2unix and your command will work for 3rd column too.
dos2unix /path/test.csv
Or, you can remove the \r at the end while counting non-empty columns using awk:
awk -F, '{sub(/\r/,"")} $3!=""{n++} END{print n}' /path/test.csv
The problem is in the grep command: the way you wrote it will return 33 lines when you count the 3rd column.
It's better instead to use the following command to count number of lines in .CSV for each column (example below is for the 3rd column):
cat /path/test.csv | cut -d , -f3 | grep -cve '^\s*$'
This will return the exact number of lines for each column and avoid of piping into wc.
See previous post here:
count (non-blank) lines-of-code in bash
edit: I think oguz ismail found the actual reason in their answer. If they are right and your file has windows line endings you can use one of the following commands without having to convert the file.
cut -d, -f3 yourFile.csv cut | tr -d \\r | grep -c .
cut -d, -f3 yourFile.csv | grep -c $'[^\r]' # bash only
old answer: Since I cannot reproduce your problem with the provided input I take a wild guess:
The "empty" fields in the last column contain spaces. A field containing a space is not empty altough it looks like it is empty as you cannot see spaces.
To count only fields that contain something other than a space adapt your regex from . (any symbol) to [^ ] (any symbol other than space).
cut -d, -f3 yourFile.csv | grep -c '[^ ]'

Multiple output in single line Shell commend with pipe only

For example:
ls -l -d */ | wc -l | awk '{print $1}' | tee /dev/tty | ls -l
This shell command print the result of wc and ls -l with single line, but tee is used.
Is it possible to using one Shell commend line to achieve multiple output without using “&&” “||” “>” “>>” “<” “;” “&”,tee and temp file?
When you want the output of date and ls -rtl | head -1 on one line, you can use
echo "$(date): $(ls -rtl | head -1)"
Yes, you can achieve writing to multiple files with awk which is not in the list of things you appear not to like:
echo hi | awk '{print > "a.txt"; print > "b.txt"}'
Then check a.txt and b.txt.

How to write a shell script that reads all the file names in the directory and finds a particular string in file names?

I need a shell script to find a string in file like the following one:
FileName_1.00_r0102.tar.gz
And then pick the highest value from multiple occurrences.
I am interested in "1.00" part of the file name.
I am able to get this part separately in the UNIX shell using the commands:
find /directory/*.tar.gz | cut -f2 -d'_' | cut -f1 -d'.'
1
2
3
1
find /directory/*.tar.gz | cut -f2 -d'_' | cut -f2 -d'.'
00
02
05
00
The problem is there are multiple files with this string:
FileName_1.01_r0102.tar.gz
FileName_2.02_r0102.tar.gz
FileName_3.05_r0102.tar.gz
FileName_1.00_r0102.tar.gz
I need to pick the file with FileName_("highest value")_r0102.tar.gz
But since I am new to shell scripting I am not able to figure out how to handle these multiple instances in script.
The script which I came up with just for the integer part is as follows:
#!/bin/bash
for file in /directory/*
file_version = find /directory/*.tar.gz | cut -f2 -d'_' | cut -f1 -d'.'
done
OUTPUT: file_version:command not found
Kindly help.
Thanks!
If you just want the latest version number:
cd /path/to/files
printf '%s\n' *r0102.tar.gz | cut -d_ -f2 | sort -n -t. -k1,2 |tail -n1
If you want the file name:
cd /path/to/files
lastest=$(printf '%s\n' *r0102.tar.gz | cut -d_ -f2 | sort -n -t. -k1,2 |tail -n1)
printf '%s\n' *${lastest}_r0102.tar.gz
You could try the following which finds all the matching files, sorts the filenames, takes the last in that list, and then extracts the version from the filename.
#!/bin/bash
file_version=$(find ./directory -name "FileName*r0102.tar.gz" | sort | tail -n1 | sed -r 's/.*_(.+)_.*/\1/g')
echo ${file_version}
I have tried and thats worth working below script line, that You need.
echo `ls ./*.tar.gz | sort | sed -n /[0-9]\.[0-9][0-9]/p|tail -n 1`;
It's unnecessary to parse the filename's version number prior to finding the actual filename. Use GNU ls's -v (natural sort of (version) numbers within text) option:
ls -v FileName_[0-9.]*_r0102.tar.gz | tail -1

How to find most frequent string in file

I have a question about bash script, lets say there is file witch contains lines, each line will have path to a file and a date, the problem is how to find most frequent path.
Thanks in advance.
Here's a suggestion
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
# \_____________________/ \__/ \_____/ \______/ \_______/
# select the file column sort print sort on print top
# files counts count result
Example use:
$ cat file.txt
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileB jan:17:13:46:27:2015
/home/admin/fileC jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
3 /home/admin/fileA
You can strip out 3 from the final result by another cut.
Reverse the lines, cut the begginning (the date), reverse them again, then sort and count unique lines:
cat file.txt | rev | cut -b 22- | rev | sort | uniq -c
If you're absolutely sure you won't have whitespace in your paths, you can avoid rev altogether:
cat file.txt | cut -d " " -f 1 | sort | uniq -c
If the output is too long to inspect visually, aioobe's suggestion of following this with sort -rn | head -n1 will serve you well
It's worth noticing, as aioobe mentioned, that many unix commands optionally take a file argument. By using it, you can avoid the extra cat command in the beginning, by supplying its argument to the next command:
cat file.txt | rev | ... vs rev file.txt | ...
While I personally find the first option both easier to remember and understand, the second is preferred by many (most?) people, as it saves up system resources (specifically, the memory and references used by an additional process) and can have better performance in some specific use cases. Wikipedia's cat article discusses this in detail.

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources