Finding text files with less than 2000 rows and deleting them - bash

I have A LOT of text files, with just one column.
Some text file have 2000 lines (consisting of numbers), and some others have less than 2000 lines (also consisting only of numbers).
I want to delete all the textiles with less than 2000 lines in them.
EXTRA INFO
The files that have less than 2000 lines, are not empty they all have line breaks till row 2000. Plus my files have some complicated names like: Nameofpop_chr1_window1.txt
I tried using awk to first count the lines of my text file, but because there are line breaks for every file I get the same result, 2000 for every file.
awk 'END { print NR }' Nameofpop_chr1_window1.txt
Thanks in advance.

You can use this awk to count non-empty lines:
awk 'NF{i++} END { print i }' Nameofpop_chr1_window1.txt
OR this awk to count only those lines that have only numbers
awk '/^[[:digit:]]+$/ {i++} END { print i }' Nameofpop_chr1_window1.txt
To delete all files with less than 2000 lines with numbers use this awk:
for f in f*; do
[[ -n $(awk '/^[[:digit:]]+$/{i++} END {if (i<2000) print FILENAME}' "$f") ]] && rm "$f"
done

you can use expr $(cat filename|sort|uniq|wc -l) - 1 or cat filename|grep -v '^$'|wc -l it will give you the number of lines per file and based on that you decidewhat to do

You can use Bash:
for f in $files; do
n=0
while read line; do
[[ -n $line ]] && ((n++))
done < $f
[ $n -lt 2000 ] && rm $f
done

Related

print lines that first column not in the list

I have a list of numbers in a file
cat to_delete.txt
2
3
6
9
11
and many txt files in one folder. Each file has tab delimited lines (can be more lines than this).
3 0.55667 0.66778 0.54321 0.12345
6 0.99999 0.44444 0.55555 0.66666
7 0.33333 0.34567 0.56789 0.34543
I want to remove the lines that the first number ($1 for awk) is in to_delete.txt and print only the lines that the first number is not in to_delete.txt. The change should be replacing the old file.
Expected output
7 0.33333 0.34567 0.56789 0.34543
This is what I got so far, which doesn't remove anything;
for file in *.txt; do awk '$1 != /2|3|6|9|11/' "$file" > "$tmp" && mv "$tmp" "$file"; done
I've looked through so many similar questions here but still cannot make it work. I also tried grep -v -f to_delete.txt and sed -n -i '/$to_delete/!p'
Any help is appreciated. Thanks!
In awk:
$ awk 'NR==FNR{a[$1];next}!($1 in a)' delete file
Output:
7 0.33333 0.34567 0.56789 0.34543
Explained:
$ awk '
NR==FNR { # hash records in delete file to a hash
a[$1]
next
}
!($1 in a) # if $1 not found in record in files after the first, output
' delete files* # mind the file order
My first idea was this:
printf "%s\n" *.txt | xargs -n1 sed -i "$(sed 's!.*!/& /d!' to_delete.txt)"
printf "%s\n" *.txt - outputs the *.txt files each on separate lines
| xargs -n1 execute the following command for each line passing the line content as the input
sed -i - edit file in place
$( ... ) - command substitution
sed 's!.*!/^& /d!' to_delete.txt - for each line in to_delete.txt, append the line with /^ and suffix with /d. That way from the list of numbers I get a list of regexes to delete, like:
/^2 /d
/^3 /d
/^6 /d
and so on. Which tells sed to delete lines matching the regex - line starting with the number followed by a space.
But I think awk would be simpler. You could do:
awk '$1 != 2 && $1 != 3 && $1 != 6 ... and so on ...`
but that would be longish, unreadable. It's easier to read the map from the file and then check if the number is in the array:
awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file"
The FNR==NR is true only for the first file. So when we read it, we set the map[$1] (we "set" it, just so such element exists). Then FNR!=NR is true for the second file, for which we check if the first element is the key in the map. If it is not, the expression is true and the line gets printed out.
all together:
for file in *.txt; do awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file" > "$tmp"; mv "$tmp" "$file"; done

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

AWK, average columns of different length from multiple files

I need to calculate average from columns from multiple files but columns have different number of lines. I guess awk is best tool for this but anything from bash will be OK. Solution for 1 column per file is OK. If solution works for files with multiple columns, even better.
Example.
file_1:
10
20
30
40
50
file_2:
20
30
40
Expected result:
15
25
35
40
50
awk would be a tool to do it easily,
awk '{a[FNR]+=$0;n[FNR]++;next}END{for(i=1;i<=length(a);i++)print a[i]/n[i]}' file1 file2
And the method could also suit for multiple files.
Brief explanation,
FNR would be the input record number in the current input file.
Record the sum of the specific column in files into a[FNR]
Record the number of show up times for the specific column into n[FNR]
Print the average value for each column using print a[i]/n[i] in the for loop
I have prepared for you the following bash script for you,
I hope this helps you.
Let me know if you have any question.
#!/usr/bin/env bash
#check if the files provided as parameters exist
if [ ! -f $1 ] || [ ! -f $2 ]; then
echo "ERROR: file> $1 or file> $2 is missing"
exit 1;
fi
#save the length of both files in variables
file1_length=$(wc -l $1 | awk '{print $1}')
file2_length=$(wc -l $2 | awk '{print $1}')
#if file 1 is longer than file 2 appends n 0\t to the end of the file
#until both files are the same length
# you can improve the scrips by creating temp files instead of working directly on the input ones
if [ "$file1_length" -gt "$file2_length" ]; then
n_zero_to_append=$(( file1_length - file2_length ))
echo "append $n_zero_to_append zeros to file $2"
#append n zeros to the end of file
yes 0 | head -n "${n_zero_to_append}" >> $2
#combine both files and compute the average line by line
awk 'FNR==NR { a[FNR""] = $0; next } { print (a[FNR""]+$0)/2 }' $1 $2
#if file 2 is longer than file 1 do the inverse operation
# you can improve the scrips by creating temp files instead of working on the input ones
elif [ "$file2_length" -gt "$file1_length" ]; then
n_zero_to_append=$(( file2_length - file1_length ))
echo "append $n_zero_to_append zeros to file $1"
yes 0 | head -n "${n_zero_to_append}" >> $1
awk 'FNR==NR { a[FNR""] = $0; next } { print (a[FNR""]+$0)/2 }' $1 $2
#if files have the same size we do not need to append anything
#and we can directly compute the average line by line
else
echo "the files : $1 and $2 have the same size."
awk 'FNR==NR { a[FNR""] = $0; next } { print (a[FNR""]+$0)/2 }' $1 $2
fi

how to ignore a newLine character in the compare script as below

#!/bin/bash
function compare {
for file1 in /dir1/*.csv
do
file2=/dir2/$(basename "$file1")
if [[ -e "$file2" ]] ### loop only if the file2 with same filename as file1 is present ###
then
awk 'BEGIN {FS==","} NR == FNR{arr[$0];next} ! ($0 in arr)' $file1 $file2 > /dirDiff/`echo $(basename "$file1")_diff`
fi
done
}
function removeNULL {
for i in /dirDiff/*_diff
do
if [[ ! -s "$i" ]] ### if file exists with zero size ###
then
\rm -- "$i"
fi
done
}
compare
removeNULL
file1 and file2 are the formatted files from two different sources. Source1 is inducing an arbitrary newLine character making one record to split into two records, causing script to fail and generate wrong diff o/p.
I want my script to compare b/w file1 and file2 by ignoring the induced newLine character by Source1. But, I am not sure how my script will identify b/w an actual new record and the manually induced newLine.
file1:-
11447438218480362,6005560623,6005560623,11447438218480362,5,20160130103044,100,195031,,1,0,00,49256,0
,195031_5_00_6,0.1,6;
11447691224860640,6997557634,6997557634,11447691224860640,601511,20160130103457,500,195035,,2,0,00,45394,0
,195035_601511_00_6,0.5,6;
file2:-
11447438218480362,6005560623,6005560623,11447438218480362,5,20160130103044,100,195031,,1,0,00,49256,0,195031_5_00_6,0.1,6;
11447691224860640,6997557634,6997557634,11447691224860640,601511,20160130103457,500,195035,,2,0,00,45394,0,195035_601511_00_6,0.5,6;
Appreciate your support.
You could preprocess your file1 joining lines not ending in ; with the next line:
sed -r ":again; /;$/! { N; s/(.+)[\r\n]+(.+)/\1\2/g; b again; }" file1
so that file1 and file2 are comparable.

In Bash, how to extract a word and a following number from a file?

I've got a list which has many entries of two different formats:
Generated Request {some text} easy level group X
---or---
easy level group X {some text}
where X is a number between 1-6 digits long.
I'm trying to go through that file line by line and reduce down everything to just "group X" on each line (so that I can then compare it to another file).
I'll post my attempt below so you can join me in laughing at it, but I'm just picking up the basics of bash, awk and sed, so I apologize now for this assault on good scripting...
for line in $(< abc.txt);do
if [ ${line:0:2} == "Ge" ] then
awk '{print $8,$9}' $line >> allgood.txt
elif [ ${line:0:2} == "ea" ] then
awk '{print $3,$4}' $line >> allgood.txt
fi
done
The attempted logic was, if it starts with "Ge", then extract phrases $8 and $9 and append to a file. If it starts with "ea", then extract phrases $3 and $4 and append to the same file. However, this doesn't work at all.
Any thoughts?
The simplest approach for this problem is to use grep:
grep -o 'group [0-9]*' file
The -o option displays only the matching part of the line.
You never have to use bash to loop over every line in a file then pass the line to awk as this is exactly how awk works, it iterates over each line and applies the relevant blocks. Here is an approach using your logic in pure awk:
awk '/^Ge/{print $8,$9}/^ea/{print $3,$4}' file
You can do this with "while read" and avoid awk if you prefer:
while read a b c d e f g h i; do
if [ ${a:0:2} == "Ge" ]; then
echo $h $i >> allgood.txt;
elif [ ${a:0:2} == "ea" ]; then
echo $c $d >> allgood.txt;
fi;
done < abc.txt
The letters represent each column, so you'll need as many as you have columns. After that you just output the letters you need.

Resources