append output of each iteration of a loop to the same in bash - bash

I have 44 files (2 for each chromosome) divided in two types: .vcf and .filtered.vcf.
I would like to make a wc -l for each of them in a loop and append the output always to the same file. However, I would like to have 3 columns in this file: chr[1-22], wc -l of .vcf and wc -l of .filtered.vcf.
I've been trying to do independent wc -l for each file and paste together columnwise the 2 outputs for each of the chromosomes, but this is obviously not very efficient, because I'm generating a lot of unnecessary files. I'm trying this code for the 22 pairs of files:
wc -l file1.vcf | cut -f 1 > out1.vcf
wc -l file1.filtered.vcf | cut -f 1 > out1.filtered.vcf
paste -d "\t" out1.vcf out1.filtered.vcf
I would like to have just one output file containing three columns:
Chromosome VCFCount FilteredVCFCount
chr1 out1 out1.filtered
chr2 out2 out2.filtered
Any help will be appreciated, thank you very much in advance :)

printf "%s\n" *.filtered.vcf |
cut -d. -f1 |
sort |
xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.vcf")" "$(wc -l <"${1}.filtered.vcf")"' --
Output newline separated list of files in the directory
Remove the extension with cut (probably something along xargs -i basename {} .filtered.vcf would be safer)
Sort it (for nice sorted output!) (probably something along sort -tr -k2 -n would sort numerically and would be even better).
xargs -n1 For each one file execute the script sh -c
printf "%s\t%s\t%s\n" - output with custom format string ...
"$1" - the filename and...
"(wc -l <"${1}.vcf")" - the count the lines in .vcf file and...
"$(wc -l <"${1}.filtered.vcf")" - the count of the lines in the .filtered.vcf
Example:
> touch chr{1..3}{,.filtered}.vcf
> echo > chr1.filtered.vcf ; echo > chr2.vcf ;
> printf "%s\n" *.filtered.vcf |
> cut -d. -f1 |
> sort |
> xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.filtered.vcf")" "$(wc -l <"${1}.vcf")"' --
chr1 0 1
chr2 1 0
chr3 0 0
To have nice looking table with headers, use column:
> .... | column -N Chromosome,VCFCount,FilteredVCFCount -t -o ' '
Chromosome VCFCount FilteredVCFCount
chr1 0 1
chr2 1 0
chr3 0 0

Maybe try this.
for chr in chr*.vcf; do
base=${chr%.vcf}
awk -v base="$base" 'BEGIN { OFS="\t"
# Remove this to not have this pesky header line
print "Chromosome", "VCFCount", "FilteredVCFCount"
}
FNR==1 && n { p=n }
{ n=FNR }
END { print base, p, n }' "$chr" "$base.filtered.vcf"
done >counts.txt
The very simple Awk script just collects the highest line number for each file (so we basically reimplement wc -l) and prints the collected numbers in the desired format. FNR is the line number in the current input file; we simply save this, and copy the value to p to keep the saved value from the previous file in a separate variable when we switch to a new file (starting over at line number 1).
The shell parameter substitution ${variable%pattern} retrieves the value of variable with any suffix match on pattern removed. (There is also ${variable#pattern} to remove a prefix, and Bash has ## and %% to trim the longest pattern match instead of the shortest.)
If efficiency is important, you could probably refactor all of the script into a single Awk script, but this way, all the pieces are simple and hopefully understandable.

Related

Get full path name of file and its size using awk

I want to get the file names followed by their size for all files having size in MB or GB. I have done this much so far :
LIST=$(ls -lh -d -1 $PWD/{*,} | awk '{ print $9":"$5 }')
for i in $LIST
do
if [[ $( echo "$i" | cut -f2 -d: | egrep "M|G" | wc -l) -ne 0 ]]
# egrep not working, only finds M
then
echo "$i" >> bigfiles
fi
done
What I am getting is :
amit#C0deDaedalus:~$ test/findbig
/home/amit/Batch:3.8M
/home/amit/Black:3.6M
What I want is :
amit#C0deDaedalus:~$ test/findbig
/home/amit/Batch File Programming.pdf:3.8M
/home/amit/Black Panther - Legend Has It ( Instrumental ).opus:3.6M
Basically, everything is working fine except filenames that I get are not complete. Only first word is shown. I can't figure out whether there is something wrong with logic or syntax but I think it has something to do with awk.
So, How do I get the full path names of files (having spaces in between) in the output ?
I have tried the loop trick in awk, but don't know how to get both of the columns to fit in.
You can use read and the convenient occurrence of the filename at the right-side of the ls -l listing. read puts all the "extra" fields into the final variable:
function f_getfields
{
local perm lnk uname grp size d1 d2 d3 filename
while read perm lnk uname grp size d1 d2 d3 filename
do
echo "$filename $size"
done < <(ls -l)
}
f_getfields
The problem is due to the spaces in your file names. The for loop uses spaces as delimeter. Therefore the first item in your list will be "/home/amit/Batch", second item "File" and so on.
You can use while loop instead of for, something like :
ls -lh -d -1 $PWD/{*,} | awk '{ print $9":"$5 }' | while read LINE
do
echo ${LINE}
# do your stuff here
done
As an aside, if your only intention is to find out large files, you may want to check out disk usage command :
$ du -a | sort -rn | head

In loop cat file - echo name of file - count

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it
grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'
You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

One line command with variable, word count and zcat

I have many files on a server which contains many lines:
201701010530.contentState.csv.gz
201701020530.contentState.csv.gz
201701030530.contentState.csv.gz
201701040530.contentState.csv.gz
I would like with one line command this result:
170033|20170101
169865|20170102
170010|20170103
170715|20170104
The goal is to have the number of lines of each file, just by keeping the date which is already in the filename of the file.
I tried this but the result is not in one line but two...
for f in $(ls -1 2017*gz);do zcat $f | wc -l;echo $f | awk '{print substr($0,1,8)}';done
Thanks in advance guys.
Just use zcat file | wc -l to get the number of lines.
For the name, I understand it is enough to extract the first 8 characters:
$ t="201701030530.contentState.csv.gz"
$ echo "${t:0:8}"
20170103
All together:
for file in 2017*gz;
do
lines=$(zcat "$file" | wc -l)
printf "%s|%s\n" "$lines" "${file:0:8}"
done > myresult.csv
Note the usage of for file in 2017*gz; to go through the files matching the 2017*gz pattern: this suffices, no need to parse ls!
Use zgrep -c ^ file to count the lines, here encapsulated in awk:
$ awk 'FNR==1{ "zgrep -c ^ " FILENAME | getline s; print s "|" substr(FILENAME,1,8) }' *.gz
12|20170101
The whole "zgrep -c ^ " FILENAME should probably be in a var (s) and then s | getline s.

How to count lines in a file and add an arbitrary number to the result?

I have a file that contains 'x' lines in it.
I need to display the number of lines in such file and add 'y'.
I know that wc -l does the trick and displays 'x' as the output, how can it be so that the output would be 'x+y'?
You could do like this,
$ wc -l file
13 yi
$ y=12
$ wc -l file | awk -v var=$y '{print $1+var}'
25
You cannot change what wc -l gives, but you can write a function that does this example:
# with variables to match your x y example:
mylines()
{
x=$(cat $1 | wc -l) # this cat is to avoid the filename in output
y=$2
echo $(( $x + $y ))
}
Example usage: mylines somefile 19
will add 19 to the number of lines in myfile and display the sum

Bash Script to save grep -c results

I am new to programming altogether and am trying to write my first bash script.
I have a file called NUMBERS.txt that has various numbers in it, as such:
1000
1001
1001
1000
1002
1001
etc..
I would like to write a script to count the occurrence of each number, save it as a variable and print it into a new text file as such:
1001= 3
1000= 2
etc..
I am completely stuck.
Here's what I have so far:
#!/bin/bash
for Count in `grep -c '1000' /NUMBERS.txt `
do
echo 'Count = '${Count}
done
for Count in `grep -c '1001' /NUMBERS.txt `
do
echo 'Count = '${Count}
done
Sort the file then count how many times each unique line occurs:
sort NUMBERS.txt | uniq -c
Now your file is already have one number on each line, it is simpler
for i in `sort -u NUMBERS.txt ` ; do count=`grep -c "$i" NUMBERS.txt ` ; echo "$i=$count" ; done > your_result.txt
or in a different format
for i in `sort -u NUMBERS.txt `
do
count=`grep -c "$i" NUMBERS.txt `
echo "$i=$count"
done > your_result.txt
As asked by , the performance is not very good. here is a much better one
sort NUMBERS.txt | uniq -c | awk '{print $1,"=",$2}'
Basically you go through NUNMBERS.txt twice. The first pass, you get the unique numbers;
The second pass you count the occurrence of each unique number.
I'm not the best at shell script, but here is a solution that works, using bash and grep -c :
#!/bin/bash
INPUT="./numbers.txt"
OUTPUT="./result.txt"
rm -f ${OUTPUT}
# you might want to change the values
for i in {1000..2000}; do
for Count in `grep -c ${i} ${INPUT}`; do
echo "${i} = ${Count}" >> ${OUTPUT}
done
done

Resources