Bash: output line count from wc in human readable format - macos

Is that possible? Doing wc the straight forward way I have to spend some mental energy to see that the file contains more than 40 million lines:
$ wc -l 20150210.txt
45614736 20150210.txt
I searched around and numfmt showed up, but that is evidently not available on OSX (nor on brew). So is there a simple way to do this on OSX? Thanks.

If you have POSIX printf you can use the %'d:
printf "%'d\n" $(wc -l < file )
From man printf:
'
For decimal conversion (i, d, u, f, F, g, G) the output is to be
grouped with thousands' grouping characters if the locale information
indicates any. Note that many versions of gcc(1) cannot parse this
option and will issue a warning. SUSv2 does not include %'F
Test
$ seq 100000 > a
$ printf "%'d\n" $(wc -l <a )
100,000
Note also the trick wc -l < file to get the number without the file name.

Related

How to extract an unknown number of lines from a file and generate a new file for each?

I have a text file (file_1), which will contain an unknown number of lines. I want to extract each line and place it in a new file (except the first line). I have been trying to do this using a for loop, wc, and head\tail, but I can't get it to work. Any suggestions?
Commands I have been using:
wc -l File_1 > File_1.wc
for i in $(seq 1 $(cat File_1.wc)); do head -${i} File_1 | tail -1 > File_1.${i}.txt ; done
Whenever I use this, I get the following error message:
seq: invalid floating point argument: ‘File_1’
Try 'seq --help' for more information.
Example File_1
Aug 1, 2020 7:08 PM Start clustering of 102 queries
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_g48.t1 GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g32.t1 GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g33.t1 GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_g11.t1 GCA_001696625
GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g10.t1 GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g11.t1 GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g12.t1 GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g13.t1 GCA_007994515.1_UK000
GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g35.t1 GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g36.t1
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_g47.t1
GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_4380183-4385401(+)_61
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_5936-11161(-)_63
Hypothetical output files:
File_1.1.txt
GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_g48.t1 GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g32.t1 GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g33.t1 GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_g11.t1 GCA_001696625
File_1.2.txt
GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g11.t1 GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g12.t1 GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g13.t1 GCA_007994515.1_UK000
File_1.3.txt
GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g35.t1 GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g36.t1
etc.
I'm not sure why this won't work. Is anyone able to suggest why and provide a new method?
Thanks
With GNU awk:
awk 'NR>1{f="File_1." NR-1 ".txt"; print >f; close(f)}' File_1
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
No need to program anything, there is standard Unix utility named split that does exactly that: splitting a file by chunks of N lines.
Here is what you are looking for, using GNU split:
$ split --lines=1 --numeric-suffixes=1 --suffix-length=5 --additional-suffix=.txt File_1 File_1.
The error you got is coming from seq, which imo should not be involve in your task, since bash or any POSIX compliant shell has a builtin that can be used for that particular task.
Also see Read a file or stream line-by-line or field-by-field in bash
Why you should not Read lines with for in bash
That said if the file/data is not that big (less than 1k+ lines). a while read loop can be used.
#!/usr/bin/env bash
file=File_1
count=1
while IFS= read -r lines; do
printf '%s %s\n' "$lines" > "$file.$((count++)).txt"
done < <(tail -n+2 "$file")
count=1 is incremented by one on every line, the count++ inside the $(( )), see Shell Arithmetic.
The > is part of the Redirection
IFS see Shell variables
Run help read on your shell.
Also help printf
The <( ) is called Process Substitution
tail -n+2 removes the first line of the file.
The $(( )) is part of the arithmetic expression/construct in bash, see Arithmetic Expansion.

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

Fastest way to print a certain portion of a file using bash commands

currently I am using sed to print the required portion of the file. For example, I used the below command
sed -n 89001,89009p file.xyz
However, it is pretty slow as the file size is increasing (my file is currently 6.8 GB). I have tried to follow this link and used the command
sed -n '89001,89009{p;q}' file.xyz
But, this command is only printing the 89001th line. Kindly, help me.
The syntax is a little bit different:
sed -n '89001,89009p;89009q' file.xyz
UPDATE:
Since there is also an answer with awk I made small comparison and as I thought - sed is a little bit faster:
$ wc -l large-file
100000000 large-file
$ du -h large-file
954M large-file
$ time sed -n '890000,890010p;890010q' large-file > /dev/null
real 0m0.141s
user 0m0.068s
sys 0m0.000s
$ time awk 'NR>=890000{print} NR==890010{exit}' large-file > /dev/null
real 0m0.433s
user 0m0.208s
sys 0m0.008s`
UPDATE2:
There is a faster way with awk as posted by #EdMorton but still not as fast as sed:
$ time awk 'NR>=890000{print; if (NR==890010) exit}' large-file > /dev/null
real 0m0.252s
user 0m0.172s
sys 0m0.008s
UPDATE:
This is the fastest way I was able to find (head and tail):
$ time head -890010 large-file| tail -10 > /dev/null
real 0m0.085s
user 0m0.024s
sys 0m0.016s
awk 'NR>=89001{print; if (NR==89009) exit}' file.xyz
Dawid Grabowski's helpful answer is the way to go (with sed[1]
; Ed Morton's helpful answer is a viable awk alternative; a tail+head combination will typically be the fastest[2]).
As for why your approach didn't work:
A two-address expression such as 89001,89009 selects an inclusive range of lines, bounded by the start and end address (line numbers, in this case).
The associated function list, {p;q;}, is then executed for each line in the selected range.
Thus, line # 89001 is the 1st line that causes the function list to be executed: right after printing (p) the line, function q is executed - which quits execution right away, without processing any further lines.
To prevent premature quitting, Dawid's answer therefore separates the aspect of printing (p) all lines in the range from quitting (q) processing, using two commands separated with ;:
89001,89009p prints all lines in the range
89009q quits processing when the range's end point is reached
[1] A slightly less repetitive reformulation that should perform equally well ($ represents the last line, which is never reached due to the 2nd command):
sed -n '89001,$ p; 89009 q'
[2] A better reformulation of the head + tail solution from Dawid's answer is
tail -n +89001 file | head -n 9, because it caps the number of bytes that are not of interest yet are still sent through the pipe at the pipe-buffer size (a typical pipe-buffer size is 64 KB).
With GNU utilities (Linux), this is the fastest solution, but on OSX with stock utilities (BSD), the sed solution is fastest.
easier to read in awk, performance should be similar to sed
awk 'NR>=89001{print} NR==89009{exit}' file.xyz
you can replace {print} with semicolon as well.
Another way to do it will be using combination of head and tail:
$ time head -890010 large-file| tail -10 > /dev/null
real 0m0.085s
user 0m0.024s
sys 0m0.016s
This is faster than sed and awk.
It requires sed to search from the beginning of the file to find the N'th line. To make things faster, divide the large file at fixed number of lines intervals using and index file. Then use dd to skip early portions of the big file before feeding to sed.
Build the index file using:
#!/bin/bash
INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"
LASTSTONE=123
MILESTONE=0
echo $MILESTONE > $INDEX_FILE
while [ $MILESTONE != $LASTSTONE ] ;do
LASTSTONE=$MILESTONE
MILESTONE=$(dd if="$LARGE_FILE" bs=1 skip=$LASTSTONE 2>/dev/null |head -n$INTERVAL |wc -c)
MILESTONE=$(($LASTSTONE+$MILESTONE))
echo $MILESTONE >> $INDEX_FILE
done
exit
Then search for a line using: ./this_script.sh 89001
#!/bin/bash
INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"
LN=$(($1-1))
OFFSET=$(head -n$((1+($LN/$INTERVAL))) $INDEX_FILE |tail -n1)
LN=$(($LN-(($LN/$INTERVAL)*$INTERVAL)))
LN=$(($LN+1))
dd if="$LARGE_FILE" bs=1 skip=$OFFSET 2>/dev/null |sed -n "$LN"p

WC on OSX - Return includes spaces

When I run the word count command in OSX terminal like wc -c file.txt I get the below answer that includes spaces padded before the answer. Does anyone know why this happens, or how I can prevent it?
18000 file.txt
I would expect to get:
18000 file.txt
This occurs using bash or bourne shell.
The POSIX standard for wc may be read to imply that there are no leading blanks, but does not say that explicitly. Standards are like that.
This is what it says:
By default, the standard output shall contain an entry for each input file of the form:
"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>
and does not mention the formats for the single-column options such as -c.
A quick check shows me that AIX, OSX, Solaris use a format which specifies the number of digits for the value — to align columns (and differ in the number of digits). HPUX and Linux do not.
So it is just an implementation detail.
I suppose it is a way of getting outputs to line up nicely, and as far as I know there is no option to wc which fine tunes the output format.
You could get rid of them pretty easily by piping through sed 's/^ *//', for example.
There may be an even simpler solution, depending on why you want to get rid of them.
At least under macOS/bash wc exhibits the behavior of outputting trailing positional TABs.
It can be avoided using expr:
echo -n "some words" | expr $(wc -c)
>> 10
echo -n "some words" | expr $(wc -w)
>> 2
Note: The -n prevents echoing a newline character which would count as 1 in wc -c
This bugs me every time I write a script that counts lines or characters. I wish that wc were defined not to emit the extra spaces, but it's not, so we're stuck with them.
When I write a script, instead of
nlines=`wc -l $file`
I always say
nlines=`wc -l < $file`
so that wc's output doesn't include the filename, but that doesn't help with the extra spaces. The trick I use next is to add 0 to the number, like this:
nlines=`expr $nlines + 0` # get rid of trailing spaces

Command to list all file types and their average size in a directory

I am working on a specific project where I need to work out the make-up of a large extract of documents so that we have a baseline for performance testing.
Specifically, I need a command that can recursively go through a directory and, for each file type, inform me of the number of files of that type and their average size.
I've looked at solutions like:
Unix find average file size,
How can I recursively print a list of files with filenames shorter than 25 characters using a one-liner? and https://unix.stackexchange.com/questions/63370/compute-average-file-size, but nothing quite gets me to what I'm after.
This du and awk combination should work for you:
du -a mydir/ | awk -F'[.[:space:]]' '/\.[a-zA-Z0-9]+$/ { a[$NF]+=$1; b[$NF]++ }
END{for (i in a) print i, b[i], (a[i]/b[i])}'
Give you something to start, with below script, you will get a list of file and its size, line by line.
#!/usr/bin/env bash
DIR=ABC
cd $DIR
find . -type f |while read line
do
# size=$(stat --format="%s" $line) # For the system with stat command
size=$(perl -e 'print -s $ARGV[0],"\n"' $line ) # #Mark Setchell provided the command, but I have no osx system to test it.
echo $size $line
done
Output sample
123 ./a.txt
23 ./fds/afdsf.jpg
Then it is your homework, with above output, you should be easy to get file type and their average size
You can use "du" maybe:
du -a -c *.txt
Sample output:
104 M1.txt
8 in.txt
8 keys.txt
8 text.txt
8 wordle.txt
136 total
The output is in 512-byte blocks, but you can change it with "-k" or "-m".

Resources