how to ignore a newLine character in the compare script as below - bash

#!/bin/bash
function compare {
for file1 in /dir1/*.csv
do
file2=/dir2/$(basename "$file1")
if [[ -e "$file2" ]] ### loop only if the file2 with same filename as file1 is present ###
then
awk 'BEGIN {FS==","} NR == FNR{arr[$0];next} ! ($0 in arr)' $file1 $file2 > /dirDiff/`echo $(basename "$file1")_diff`
fi
done
}
function removeNULL {
for i in /dirDiff/*_diff
do
if [[ ! -s "$i" ]] ### if file exists with zero size ###
then
\rm -- "$i"
fi
done
}
compare
removeNULL
file1 and file2 are the formatted files from two different sources. Source1 is inducing an arbitrary newLine character making one record to split into two records, causing script to fail and generate wrong diff o/p.
I want my script to compare b/w file1 and file2 by ignoring the induced newLine character by Source1. But, I am not sure how my script will identify b/w an actual new record and the manually induced newLine.
file1:-
11447438218480362,6005560623,6005560623,11447438218480362,5,20160130103044,100,195031,,1,0,00,49256,0
,195031_5_00_6,0.1,6;
11447691224860640,6997557634,6997557634,11447691224860640,601511,20160130103457,500,195035,,2,0,00,45394,0
,195035_601511_00_6,0.5,6;
file2:-
11447438218480362,6005560623,6005560623,11447438218480362,5,20160130103044,100,195031,,1,0,00,49256,0,195031_5_00_6,0.1,6;
11447691224860640,6997557634,6997557634,11447691224860640,601511,20160130103457,500,195035,,2,0,00,45394,0,195035_601511_00_6,0.5,6;
Appreciate your support.

You could preprocess your file1 joining lines not ending in ; with the next line:
sed -r ":again; /;$/! { N; s/(.+)[\r\n]+(.+)/\1\2/g; b again; }" file1
so that file1 and file2 are comparable.

Related

Bash Loop Through End of File

I'm working on script that will find a pattern from a keyword and a list of other keywords on a separate file.
File1 has the list, which is a word per line. File2 has another list--the one that I actually want to search.
while read LINE; do
grep -q $LINE file2
if [ $? -eq 0 ]; then
echo "Found $LINE in file2."
grep $LINE file2 | grep example
if [ $? -eq 0 ]; then
echo "Keeping $LINE"
else
echo "Deleting $LINE"
sed -i "/$LINE/d" file2
fi
else
echo "Did not find $LINE in file2."
fi
done < file1
What I want is to take each word from file1 and search for every instance of it in file2. From those instances, I want to search for all the instances that contain the word example. Any instances that dont contain example, I want to delete them.
My code, it takes a word from file1 and searches for an instance of it in file2. Once it finds that instance, the loop moves on to the next word in file1, when it should continue searching for file2 for the previous word; it should only move on to the next file1 word when it has completed searching file2 for the current word.
Any help on how to achieve this?
Suggesting awk script, to scan each file only once.
awk 'FRN == RN {wordsArr[++wordsCount] = $0} # read file1 lines into array
FRN != RN && /example/ { # read file2 line matching regExp /example/
for (i in wordsArr) { # scan all words in array
if ($0 ~ wordsArr[i]) { # if a word matched in current line
print; # print the current line
next; # skip rest of words,read next line
}
}
}' file1 file2

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

comparing two files and priniting lines with similar strings in one file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 6 years ago.
I have two file which I need to compare, and if the first column in file1 matches part of the fisrt columns in file2, then add them side by side in file3, below is an example:
File1:
123123,ABC,2016-08-18,18:53:53
456456,ABC,2016-08-18,18:53:53
789789,ABC,2016-08-18,18:53:53
123123,ABC,2016-02-15,12:46:22
File2
789789_TTT,567774,223452
123123_TTT,121212,343434
456456_TTT,323232,223344
output:
123123,ABC,2016-08-18,18:53:53,123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53,456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53,789789_TTT,567774,223452
123123,ABC,2016-02-15,18:53:53,123123_TTT,121212,343434
Thanks..
Usin Gnu AWK:
$ awk -F, 'NR==FNR{a[gensub(/([^_]*)_.*/,"\\1","g",$1)]=$0;next} $1 in a{print $0","a[$1]}' file2 file1
123123,ABC,2016-08-18,18:53:53 123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53 456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53 789789_TTT,567774,223452
123123,ABC,2016-02-15,12:46:22 123123_TTT,121212,343434
Explanation:
NR==FNR { # for the first file (file2)
a[gensub(/([^_]*)_.*/,"\\1","g",$1)]=$0 # store to array
next
}
$1 in a { # if the key from second file in array
print $0","a[$1] # output
}
awk solution matches keys formed from file2 against column 1 of file1 - should also work on Solaris using /usr/xpg4/bin/awk - I took the liberty of assuming the last line of OP output has a typo
file1=$1
file2=$2
AWK=awk
[[ $(uname) == SunOS ]] && AWK=/usr/xpg4/bin/awk
$AWK -F',' '
BEGIN{OFS=","}
# file2 key is part of $1 till underscore
FNR==NR{key=substr($1,1,index($1,"_")-1); f2[key]=$0; next}
$1 in f2 {print $0, f2[$1]}
' $file2 $file1
tested
123123,ABC,2016-08-18,18:53:53,123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53,456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53,789789_TTT,567774,223452
123123,ABC,2016-02-15,12:46:22,123123_TTT,121212,343434
Pure bash solution
file1=$1
file2=$2
while IFS= read -r line; do
key=${line%%_*}
f2[key]=$line
done <$file2
while IFS= read -r line; do
key=${line%%,*}
[[ -n ${f2[key]} ]] || continue
echo "$line,${f2[key]}"
done <$file1

Finding text files with less than 2000 rows and deleting them

I have A LOT of text files, with just one column.
Some text file have 2000 lines (consisting of numbers), and some others have less than 2000 lines (also consisting only of numbers).
I want to delete all the textiles with less than 2000 lines in them.
EXTRA INFO
The files that have less than 2000 lines, are not empty they all have line breaks till row 2000. Plus my files have some complicated names like: Nameofpop_chr1_window1.txt
I tried using awk to first count the lines of my text file, but because there are line breaks for every file I get the same result, 2000 for every file.
awk 'END { print NR }' Nameofpop_chr1_window1.txt
Thanks in advance.
You can use this awk to count non-empty lines:
awk 'NF{i++} END { print i }' Nameofpop_chr1_window1.txt
OR this awk to count only those lines that have only numbers
awk '/^[[:digit:]]+$/ {i++} END { print i }' Nameofpop_chr1_window1.txt
To delete all files with less than 2000 lines with numbers use this awk:
for f in f*; do
[[ -n $(awk '/^[[:digit:]]+$/{i++} END {if (i<2000) print FILENAME}' "$f") ]] && rm "$f"
done
you can use expr $(cat filename|sort|uniq|wc -l) - 1 or cat filename|grep -v '^$'|wc -l it will give you the number of lines per file and based on that you decidewhat to do
You can use Bash:
for f in $files; do
n=0
while read line; do
[[ -n $line ]] && ((n++))
done < $f
[ $n -lt 2000 ] && rm $f
done

Take two at a time in a bash "for file in $list" construct

I have a list of files where two subsequent ones always belong together. I would like a for loop extract two files out of this list per iteration, and then work on these two files at a time (for an example, let's say I want to just concatenate, i.e. cat the two files).
In a simple case, my list of files is this:
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt"
I could hack around it and say
FILES="file1 file2"
for file in $FILES
do
actual_mateA=${file}_mateA.txt
actual_mateB=${file}_mateB.txt
cat $actual_mateA $actual_mateB
done
But I would like to be able to handle lists where mate A and mate B have arbitrary names, e.g.:
FILES="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
Is there a way to extract two values out of $FILES per iteration?
Use an array for the list:
files=(fileA1 fileA2 fileB1 fileB2)
for (( i=0; i<${#files[#]} ; i+=2 )) ; do
echo "${files[i]}" "${files[i+1]}"
done
You could read the values from a while loop and use xargs to restrict each read operation to two tokens.
files="filaA1 fileA2 fileB1 fileB2"
while read -r a b; do
echo $a $b
done < <(echo $files | xargs -n2)
You could use xargs(1), e.g.
ls -1 *.txt | xargs -n2 COMMAND
The switch -n2 let xargs select 2 consecutive filenames from the pipe output which are handed down do the COMMAND
To concatenate the 10 files file01.txt ... file10.txt pairwise
one can use
ls *.txt | xargs -n2 sh -c 'cat $# > $1.$2.joined' dummy
to get the 5 result files
file01.txt.file02.txt.joined
file03.txt.file04.txt.joined
file05.txt.file06.txt.joined
file07.txt.file08.txt.joined
file09.txt.file10.txt.joined
Please see 'info xargs' for an explantion.
How about this:
park=''
for file in $files # wherever you get them from, maybe $(ls) or whatever
do
if [ "$park" = '' ]
then
park=$file
else
process "$park" "$file"
park=''
fi
done
In each odd iteration it just stores the value (in park) and in each even iteration it then uses the stored and the current value.
Seems like one of those things awk is suited for
$ awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }' <<< "$FILES"
file1_mateA.txt file1_mateB.txt
file2_mateA.txt file2_mateB.txt
You could then loop over it by setting IFS=$'\n'
e.g.
#!/bin/bash
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt file3_mat
input=$(awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }'
IFS=$'\n'
for set in $input; do
cat "$set" # or something
done
Which will try to do
$ cat file1_mateA.txt file1_mateB.txt
$ cat file2_mateA.txt file2_mateB.txt
And ignore the odd case without the match.
You can transform you string to array and read this new array by elements:
#!/bin/bash
string="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
array=(${string})
size=${#array[*]}
idx=0
while [ "$idx" -lt "$size" ]
do
echo ${array[$idx]}
echo ${array[$(($idx+1))]}
let "idx=$idx+2"
done
If you have delimiter in string different from space (i.e. ;) you can use the following transformation to array:
array=(${string//;/ })
You could try something like this:
echo file1 file2 file3 file4 | while read -d ' ' a; do read -d ' ' b; echo $a $b; done
file1 file2
file3 file4
Or this, somewhat cumbersome technique:
echo file1 file2 file3 file4 |tr " " "\n" | while :;do read a || break; read b || break; echo $a $b; done
file1 file2
file3 file4

Resources