How to find the difference between the values two fields from two files and print only if there is a difference >10 using shell - shell

Let say, i have two files a.txt and b.txt. the content of a.txt and b.txt is as follows:
a.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|10|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|11|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
So let's say these files have various fields separated by "|" and can have any number of lines. Also, assume that both are sorted files and so that we can match exact line between the two files. Now, i want to find the difference between the fields 8 & 9 of each row of each to be compared respectively and if any of their difference is greater than 10, then print the lines, otherwise remove the lines from file.
i.e., in the given example, i will subtract |10-11| (respective field no. 8 which is 1(absolute value) from a.txt and b.txt) and similarly for field no. 9 (0-0) which is 0,and both the difference is <10 so we delete this line from the files.
for the second line, the differences are (11-22)= 10 so we print this line.(dont need to check 19-18 as if any of the fields values(8,9) is >=10 we print such lines.
So the output is
a.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00

You can do this with awk:
awk -F\| 'FNR==NR{x[FNR]=$0;eight[FNR]=$8;nine[FNR]=$9;next} {d1=eight[FNR]-$8;d2=nine[FNR]-$9;if(d1>10||d1<-10||d2>10||d2<-10){print x[FNR] >> "newa";print $0 >> "newb"}}' a.txt b.txt
Explanation
The -F sets the field separator to the pipe symbol. The stuff in curly braces after FNR==NR applies only to the processing of a.txt. It says to save the whole line in array x[] indexed by line number (FNR) and also to save the eighth field in array eight[] also indexed by line number. Likewise field 9 is saved in array nine[].
The second set of curly braces applies to processing file b. It calculates the differences d1 and d2. If either exceeds 10, the line is printed to each of the files newa and newb.

You can write bash shell script that does it:
while true; do
read -r lineA <&3 || break
read -r lineB <&4 || break
vara_8=$(echo "$lineA" | cut -f8 -d "|")
varb_8=$(echo "$lineB" | cut -f8 -d "|")
vara_9=$(echo "$lineA" | cut -f9 -d "|")
varb_9=$(echo "$lineB" | cut -f9 -d "|")
if (( vara_8-varb_8 > 10 || vara_8-varb_8 < -10
|| vara_9-varb_9 > 10 || vara_9-varb_9 < -10 )); then
echo "$lineA" >> newA.txt
echo "$lineB" >> newB.txt
fi
done 3<a.txt 4<b.txt

For short files
Use the method provided by Mark Setchell. Seen below in an expanded and slightly modified version:
parse.awk
FNR==NR {
x[FNR] = $0
m[FNR] = $8
n[FNR] = $9
next
}
{
if(abs(m[FNR] - $8) || abs(n[FNR] - $9)) {
print x[FNR] >> "newa"
print $0 >> "newb"
}
}
Run it like this:
awk -f parse.awk a.txt b.txt
For huge files
The method above reads a.txt into memory. If the file is very large, this becomes unfeasible and streamed parsing is called for.
It can be done in a single pass, but that requires careful handling of the multiplexed lines from a.txt and b.txt. A less error prone approach is to identify relevant line numbers, and then extract those into new files. An example of the last approach is shown below.
First you need to identify the matching lines:
# Extract fields 8 and 9 from a.txt and b.txt
paste <(awk -F'|' '{print $8, $9}' OFS='\t' a.txt) \
<(awk -F'|' '{print $8, $9}' OFS='\t' b.txt) |
# Check if it the fields matche the criteria and print line number
awk '$1 - $3 > n || $3 - $1 > n || $2 - $4 > n || $4 - $2 > 10 { print NR }' n=10 > linesfile
Now we are ready to extract the lines from a.txt and b.txt, and as the numbers are sorted, we can use the extract.awk script proposed here (repeated for convenience below):
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Extract the lines (can be run in parallel):
awk -v linesfile=linesfile -f extract.awk a.txt > newa
awk -v linesfile=linesfile -f extract.awk b.txt > newb

Related

Bash script to print X lines of a file in sequence

I'd be very grateful for your help with something probably quite simple.
I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.
2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974
I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.
What I have so far is this:
#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done
Desired output for file0.txt:
2655087
3721239
5728533
9082076
Desired output for file1.txt:
2016819
8983893
9446748
6607974
Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.
This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)
Thank you very much.
The beauty of awk is that you can do this in one awk line :
awk '{ print > ("file"c".txt") }
(NR % 4 == 0) { ++c }
(c == 10001) { exit }' <file>
This can be slightly more optimized and file handling friendly (cfr. James Brown):
awk 'BEGIN{f="file0.txt" }
{ print > f }
(NR % 4 == 0) { close(f); f="file"++c".txt" }
(c == 10001) { exit }' <file>
Why did your script fail?
The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :
awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt
but this is very ugly and should be improved with
awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt
Why is your script slow?
The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk calls. Each awk call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.
To optimise your script step by step, I would do the following :
Terminate awk to step reading the file after the line is print
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
done
Merge the 4 awk calls into a single one. This prevents reading the first lines over and over per loop cycle.
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i '(NR<=4*i) {next} # skip line
(NR> 4*(i+1)}{exit} # exit awk
1' table2.txt > file"$i".txt # print line
done
remove the final loop (see top of this answer)
This is functionally the same as #JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.
awk '
(NR%4)==1 { close(out); out="file" c++ ".txt" }
c > 10000 { exit }
{ print > out }
' file
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.
With just bash you can do it very simple:
chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
split -d -a 5 --additional-suffix=.txt -l $chunk - file
Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file as prefix and .txt as suffix for the new files.
If you want a numeric identifier, you will need 5 digits (-a 5), as pointed in the comments (credit: #kvantour).
Another awk:
$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls
file file0.txt file1.txt
Explained:
awk ' {
if(NR%4==1) { # use mod to recognize first record of group
if(i==10000) # exit after 10000 files
exit # test with 1
close(f) # close previous file
f="file" i++ ".txt" # make a new filename
}
print > f # output record to file
}' file

Including empty lines using pattern

My problem is the following: I have a text file where there are no empty lines, now I would like to include the lines according to the pattern file where 1 means print the line without including a new line, 0 - include a new line. My text file is :
apple
banana
orange
milk
bread
Thу pattern file is :
1
1
0
1
0
1
1
The desire output correspondingly:
apple
banana
orange
milk
bread
What I tried is:
for i in $(cat pattern file);
do
awk -v var=$i '{if var==1 {print $0} else {printf "\n" }}' file;
done.
But it prints all the lines first, and only after that it changes $i
Thanks for any prompts.
Read the pattern file into an array, then use that array when processing the text file.
awk 'NR==FNR { newlines[NR] = $0; next}
{ print $0 (newlines[FNR] ? "" : "\n") }' patternfile textfile
allow multiple 0 between 1
Self documented code
awk '# for file 1 only
NR==FNR {
#load an array with 0 and 1 (reversed due to default value of an non existing element = 0)
n[NR]=!$1
# cycle to next line (don't go furthier in the script for this line)
next
}
# at each line (of file 2 due to next of last bloc)
{
# loop while (next due to a++) element of array = 1
for(a++;n[a]==1;a++){
# print an empty line
printf( "\n")
}
# print the original line
print
}' pattern YourFile
need of inversion of value to avoid infinite new line on last line in case there is less info in pattern than line in data file
multiple 0 need a loop + test
unsynchro between file number of pattern and data file is a problem using a direct array (unless it keep how much newline to insert, another way to doing it)
This is a bit of a hack, but I present it as an alternative to your traditionally awk-y solutions:
paste -d, file.txt <(cat pattern | tr '\n' ' ' | sed 's,1 0,10,g' | tr ' ' '\n' | tr -d '1') | tr '0' '\n' | tr -d ','
The output looks like this:
apple
banana
orange
milk
bread
Inverse of Barmar's, read the text into an array and then print as you process the pattern:
$ awk 'NR==FNR {fruit[NR]=$0; next} {print $0?fruit[++i]:""}' fruit.txt pattern.txt
apple
banana
orange
milk
For an answer using only bash:
i=0; mapfile file < file
for p in $(<pattern); do
((p)) && printf "%s" "${file[i++]}" || echo
done

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

get Nth line in file after parsing another file

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,
while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt
while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere
awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.
Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'
This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh
sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

Print last 10 rows of specific columns using awk

I have the below awk command-line argument and it works aside from the fact it performs the print argument on the entire file (as expected). I would like it to just perform the formatting on the last 10 lines of the file (or any arbitrary number). Any suggestions are greatly appreciated, thanks!
I know one solution would be to pipe it with tail, but would like to stick with a pure awk solution.
awk '{print "<category label=\"" $13 " " $14 " " $15 "\"/>"}' foofile
There is no need to be orthodox with a language or tool on the Unix shell.
tail -10 foofile | awk '{print "<category label=\"" $13 " " $14 " " $15 "\"/>"}'
is a good solution. And, you already had it.
Your arbitrary number can still be used as an argument to tail, nothing is lost;
solution does not lose any elegance.
Using ring buffers, this one-liner prints last 10 lines;
awk '{a[NR%10]=$0}END{for(i=NR+1;i<=NR+10;i++)print a[i%10]}'
then, you can merge "print last 10 lines" and "print specific columns" like below;
{
arr_line[NR % 10] = $0;
}
END {
for (i = NR + 1; i <= NR + 10; i++) {
split(arr_line[i % 10], arr_field);
print "<category label=\"" arr_field[13] " " \
arr_field[14] " " \
arr_field[15] "\"/>";
}
}
I don't think this can be tidily done in awk. The only way you can do it is to buffer the last X lines, and then print them in the END block.
I think you'll be better off sticking with tail :-)
Just for last 10 rows
awk 'BEGIN{OFS="\n"}
{
a=b;b=c;c=d;d=e;e=f;f=g;g=h;h=i;i=j;j=$0
}END{
print a,b,c,d,e,f,g,h,i,j
}' file
In the case of variable # of columns, i have worked out two solutions
#cutlast [number] [[$1] [$2] [$3]...]
function cutlast {
length=${1-1}; shift
list=( ${#-`cat /proc/${$}/fd/0`} )
output=${list[#]:${#list[#]}-${length-1}}
test -z "$output" && exit 1 || echo $output && exit 0
}
#example: cutlast 2 one two three print print # echo`s print print
#example1: echo one two three four print print | cutlast 2 # echo`s print print
or
function cutlast {
length=${1-1}; shift
list=( ${#-`cat /proc/${$}/fd/0`} )
aoutput=${#-`cat /proc/${$}/fd/0`} | rev | cut -d ' ' -f-$num | rev
test -z "$output" && exit 1 || echo $output && exit 0
}
#example: cutlast 2 one two three print print # echo`s print print
There is loads of awk one liners in this text document, not sure if any of those will help.
This specifically might be what you're after (something similar anyway):
# print the last 2 lines of a file (emulates "tail -2")
awk '{y=x "\n" $0; x=$0};END{print y}'
awk '{ y=x "\n" $0; x=$0 }; END { print y }'
This is very inefficient: what it does is reading the whole file line by line only to print the last two lines.
Because there is no seek() statement in awk it is recommended to use tail to print the last lines of a file.

Resources