I have a bunch of files of the form myfile[somenumber] that are in nested directories.
I want to generate a line count on each of the files, and output that count to a file.
These files are binary and so they have to be piped through an additional script open_file before they can be counted by "wc". I do:
ls ~/mydir/*/*/other_dir/myfile* | while read x; do open_file $x | wc -l; done > stats
this works, but the problem is that it outputs the line counts to the file stats without saying the original filename. for example, it outputs:
100
150
instead of:
/mydir/...pathhere.../myfile1: 100
/mydir/...pathhere.../myfile2: 150
Second question:
What if I wanted to divide the number of wc -l by a constant, e.g. dividing it by 4, before outputting it to the file?
I know that the number of lines is a multiple of 4 so the result should be in an integer. Not sure how to do that from the above script.
how can I make it put the original filename and the wc -l result in the output file?
thank you.
You can output the file name before counting the lines:
echo -n "$x: " ; open_file $x | wc -l. The -n parameter to echo omits the trailing newline in the output.
To divide integers, you can use expr, e.g., expr $(open_file $x | wc -l) / 4.
So, the complete while loop will look as follows:
while read x; do echo -n "$x: " ; expr $(open_file $x | wc -l) / 4 ; done
Try this:
while read x; do echo -n "$x: " ; s=$(open_file $x | wc -l); echo $(($s / 4));
You've thrown away the filename by the time you get to wc(1) -- all it ever sees is a pipe(7) -- but you can echo the filename yourself before opening the file. If open_file fails, this will leave you with an ugly output file, but it might be a suitable tradeoff.
The $((...)) uses bash(1) arithmetic expansion. It might not work on your shell.
Related
I'm new to bash so I'm finding trouble doing something very basic.
Through playing with various scripts I found out that the following script prints the lines that contain the "word"
for file in*; do
cat $file | grep "word"
done
doing the following:
for file in*; do
cat $file | grep "word" | wc -l
done
had a result of printing in every iteration how many times did the "word" appeared on file.
How can I implement a counter for all those appearances and in the
end just echo the counter?
I used a counter that way but it appeared 0.
let x+=cat $filename | grep "word"
You can pipe the entire loop to wc -l.
for file in *; do
cat $file | grep "word"
done | wc -l
This is a useless use of cat. How about:
for file in *; do
grep "word" "$file"
done | wc -l
Actually, the entire loop is unnecessary if you pass all the file names to grep at once.
grep "word" * | wc -l
Note that if word shows up more than once on the same line these solutions will only count the entire line once. If you want to count same-line occurrences separately you can use -o to print each match on a separate line:
grep -o "word" * | wc -l
The oneliner in John's answer is the way to go. Just to satisfy your curiosity:
sum=0
for f in *; do
x="$(grep 'word' "$f" | wc -l)"
echo "x: $x"
(( sum += x ))
done
echo "sum: $sum"
If the line containing the grep and wc does not yield a number you are SOL. That is why you should stick to the other solution or do a pure bash implementation with things like read, 'case and *word*)' or if [[ "$line" =~ "$re_containing_word" ]]; then ...
Hi I have a script that is going to count the number of records in a file and find the expected delimiters per a record by dividing the total record count by rs_count. It works fine but it is a little slow on large records. I was wondering if there is a way to improve performance. The RS is a special character octal \246. I am using bash shell script.
Some additional info:
A line is a record.
The file will always have the same number of delimiters.
The purpose of the script is to check if the file has the expected number of fields. After calculating it, the script just echos it out.
for file in $SOURCE; do
echo "executing File -"$file
if (( $total_record_count != 0 ));then
filename=$(basename "$file")
total_record_count=$(wc -l < $file)
rs_count=$(sed -n 'l' $file | grep -o $RS | wc -l)
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
Counting the delimiters (not total records) in a file
On a file with 50,000 lines, I note around a 10 fold increase by incorporating the sed, grep, and wc pipeline to a single awk process:
awk -v RS='Delimiter' 'END{print NR -1}' input_file
Dealing with wc when there's no trailing line breaks
If you count the instances of ^ (start of line), you will get a true count of lines. Using grep:
grep -co "^" input_file
(Thankfully, even though ^ is a regex, the performance of this is on par with wc)
Incorporating these two modifications into a trivial test based on your supplied code:
#!/usr/bin/env bash
SOURCE="$1"
RS=$'\246'
for file in $SOURCE; do
echo "executing File -"$file
if [[ $total_record_count != 0 ]];then
filename=$(basename "$file")
total_record_count=$(grep -oc "^" $file)
rs_count="$(awk -v RS=$'\246' 'END{print NR -1}' $file)"
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
echo -e "\$rs_count:\t${rs_count}\n\$Delimiter_per_record:\t${Delimiter_per_record}\n\$total_record_count:\t${total_record_count}" | column -t
Running this on a file with 50,000 lines on my macbook:
time ./recordtest.sh /tmp/randshort
executing File -/tmp/randshort
$rs_count: 186885
$Delimiter_per_record: 3
$total_record_count: 50000
real 0m0.064s
user 0m0.038s
sys 0m0.012s
Unit test one-liner
(creates /tmp/recordtest, chmod +x's it, creates /tmp/testfile with 10 lines of random characters including octal \246, and then runs the script file on the testfile)
echo $'#!/usr/bin/env bash\n\nSOURCE="$1"\nRS=$\'\\246\'\n\nfor file in $SOURCE; do\n echo "executing File -"$file\n if [[ $total_record_count != 0 ]];then\n filename=$(basename "$file")\n total_record_count=$(grep -oc "^" $file)\n rs_count="$(awk -v RS=$\'\\246\' \'END{print NR -1}\' $file)"\n Delimiter_per_record=$((rs_count/total_record_count))\n fi\ndone\n\necho -e "\\$rs_count:\\t${rs_count}\\n\\$Delimiter_per_record:\\t${Delimiter_per_record}\\n\\$total_record_count:\\t${total_record_count}" | column -t' > /tmp/recordtest ; echo $'\246459ca4f23bafff1c8fc017864aa3930c4a7f2918b\246753f00e5a9278375b\nb\246a3\246fc074b0e415f960e7099651abf369\246a6f\246f70263973e176572\2467355\n1590f285e076797aa83b2ee537c7f99\24666990bb60419b8aa\246bb5b6b\2467053\n89b938a5\246560a54f2826250a2c026c320302529331229255\246ef79fbb52c2\n9042\246bb\246b942408a22f912268ffc78f08c\2462798b0c05a75439\246245be2ea5\n0ef03170413f90e\246e0\246b1b2515c4\2466bf0a1bb\246ee28b78ccce70432e6b\24653\n51229e7ab228b4518404360b31a\2463673261e3242985bf24e59bc657\246999a\n9964\246b08\24640e63fae788ea\246a1777\2460e94f89af8b571e\246e1b53e6332\246c3\246e\n90\246ae12895f\24689885e\246e736f942080f267a275132a348ec1e837b99efe94\n2895e91\246\246f506f\246c1b986a63444b4258\246bc1b39182\24630\24696be' > /tmp/testfile ; chmod +x /tmp/recordtest ; /tmp/./recordtest /tmp/testfile
Which produces this result:
$rs_count: 39
$Delimiter_per_record: 3
$total_record_count: 10
Though there's a number of solutions for counting instances of characters in files, quite a few come undone when trying to process special characters like octal \246
awk seems to handle it reliably and quickly.
I am trying to write a bash script in a KSH environment that would iterate through a source text file and process it by blocks of lines
So far I have come up with this code, although it seems to go indefinitely since the tail command does not return 0 lines if asked to retrieve lines beyond those in the source text file
i=1
while [[ `wc -l /path/to/block.file | awk -F' ' '{print $1}'` -gt $((i * 1000)) ]]
do
lc=$((i * 1000))
DA=ProcessingResult_$i.csv
head -$lc /path/to/source.file | tail -1000 > /path/to/block.file
cd /path/to/processing/batch
./process.sh #This will process /path/to/block.file
mv /output/directory/ProcessingResult.csv /output/directory/$DA
i=$((i + 1))
done
Before launching the above script I perform a manual 'first injection': head -$lc /path/to/source.file | tail -1000 > /path/to/temp.source.file
Any idea on how to get the script to stop after processing the last lines from the source file?
Thanks in advance to you all
If you do not want to create so many temporary files up front before beginning to process each block, you could try the below solution. It can save lot of space when processing huge files.
#!/usr/bin/ksh
range=$1
file=$2
b=0; e=0; seq=1
while true
do
b=$((e+1)); e=$((range*seq));
sed -n ${b},${e}p $file > ${file}.temp
[ $(wc -l ${file}.temp | cut -d " " -f 1) -eq 0 ] && break
## process the ${file}.temp as per your need ##
((seq++))
done
The above code generates only one temporary file at a time.
You could pass the range(block size) and the filename as command line args to the script.
example: extractblock.sh 1000 inputfile.txt
have a look to man split
NAME
split - split a file into pieces
SYNOPSIS
split [OPTION]... [INPUT [PREFIX]]
-l, --lines=NUMBER
put NUMBER lines per output file
For example
split -l 1000 source.file
Or to extract the 3rd chunk for example (1000 here is not the number of lines , it is the number of chunks, or a chunk is 1/1000 of source.file)
split -nl/3/1000 source.file
A note on condition :
[[ `wc -l /path/to/block.file | awk -F' ' '{print $1}'` -gt $((i * 1000)) ]]
Maybe it should be source.file instead of block.file, and it is quite inefficient on a big file because it will read (count the lines of the file) for each iteration ; number of lines can be stored in a variable, also using wc on standard input prevents from using awk:
nb_lines=$(wc -l </path/to/source.file )
With Nahuel's recommendation I was able to build the script like this:
i=1
cd /path/to/sourcefile/
split source.file -l 1000 SF
for sf in /path/to/sourcefile/SF*
do
DA=ProcessingResult_$i.csv
cd /path/to/sourcefile/
cat $sf > /path/to/block.file
rm $sf
cd /path/to/processing/batch
./process.sh #This will process /path/to/block.file
mv /output/directory/ProcessingResult.csv /output/directory/$DA
i=$((i + 1))
done
This worked great
I'm trying to count the number of entries in a set of log files. Some of these logs have lines that should not be counted (the number of these remains constant). The way I'd like to go about this is a Perl script that iterates over a hash, which maps log names to a one-liner that gets the number of entries for that particular log (I figured this would be easier to maintain than dozens of if-else statements)
Getting the number of lines is simple:
wc -l [logfile] | cut -f1 -d " "
The issue is when I need to subtract, say, 1 or 2 from this value. I tried the following:
expr( wc -l [logfile] | cut -f1 -d " " ) - 1
But this results in an error:
Badly placed ()'s.
: Command not found.
How do I perform arithmetic operations on the output of a shell command? Is there a better way to do this?
To display one less than the number of lines with bash or any bourne-like shell:
echo $(( $(wc -l <file) - 1 ))
Discussion
To get the number of lines, you used:
wc -l logfile | cut -f1 -d " "
cut is required here because wc copies the file name to its output. To avoid that, and thus avoid the need for cut, supply the input to wc via stdin:
wc -l <logfile
In modern (POSIX) shells, arithmetic is done with $((...)). Thus, we can substract one from the number of lines via:
$(( $(wc -l <file) - 1 ))
It's a bit clunky to shell out to wc and cut just to count the number of lines in a file.
Your requirement isn't very clear, but this Perl code creates a hash that relates every log file in the current directory to the number of lines it contains. It works by reading each file into an array of lines, and then evaluating that array in scalar context to give the line count. I hope it's obvious how to subtract a constant delta from each line count.
use strict;
use warnings;
my %lines;
for my $logfile ( glob '*.log' ) {
my $num_lines = do {
open my $fh, '<', $logfile or die qq{Unable to open "$logfile" for input: $!};
my #lines = <$fh>;
};
$lines{$logfile} = $num_lines;
}
Update
After a comment from w.k, I think this version may be rather nicer
use strict;
use warnings;
my %lines;
for my $logfile ( glob '*.log' ) {
open my $fh, '<', $logfile or die qq{Unable to open "$logfile" for input: $!};
1 while <$fh>;
$lines{$logfile} = $.;
}
The existing answers went in the direction of solving your issue in perl, which you mentioned but your own experiments were in shell syntax.
You indicated tcsh but expr is Posix shell syntax.
Here is an example of a csh script that counts the number of lines in a file whose name it is passed and then does arithmetic on the number of lines.
set lines=`wc -l < $1`
# oneless = ($lines - 1)
echo "There are $lines in $1 and minus one makes $oneless"
Test:
csh count.csh count.csh
There are 3 lines in count.csh and minus one makes 2
Say if i wanted to do this command:
(cat file | wc -l)/2
and store it in a variable such as middle, how would i do it?
I know its simply not the case of
$middle=$(cat file | wc -l)/2
so how would i do it?
middle=$((`wc -l < file` / 2))
middle=$((`wc -l file | awk '{print $1}'`/2))
This relies on Bash being able to reference the first element of an array using scalar syntax and that is does word splitting on white space by default.
middle=($(wc -l file)) # create an array which looks like: middle='([0]="57" [1]="file")'
middle=$((middle / 2)) # do the math on ${middle[0]}
The second line can also be:
((middle /= 2))
When assigning variables, you don't use the $
Here is what I came up with:
mid=$(cat file | wc -l)
middle=$((mid/2))
echo $middle
The double parenthesis are important on the second line. I'm not sure why, but I guess it tells Bash that it's not a file?
using awk.
middle=$(awk 'END{print NR/2}' file)
you can also make your own "wc" using just the shell.
linec(){
i=0
while read -r line
do
((i++))
done < "$1"
echo $i
}
middle=$(linec "file")
echo "$middle"