Extract common lines from multiple text files and display original line numbers - shell

What I want?
Extract the common lines from n large files.
Append the original line numbers of each files.
Example:
File1.txt has the following content
apple
banana
cat
File2.txt has the following content
boy
girl
banana
apple
File3.txt has the following content
foo
apple
bar
The output should be a different file
1 3 2 apple
1, 3 and 2 in the output are the original line numbers of File1.txt, File2.txt and File3.txt where the common line apple exists
I have tried using grep -nf File1.txt File2.txt File3.txt, but it returns
File2.txt:3:apple
File3.txt:2:apple

Associate each unique line with a space separated list of line numbers indicating where it is seen in each file in an array, and print these next to each other at the end if the line is found in all three files.
awk '{
n[$0] = n[$0] FNR OFS
c[$0]++
}
END {
for (r in c)
if (c[r] == 3)
print n[r] r
}' file1 file2 file3
If the number of files is unknown, refer to Ravinder's answer, or just change the hardcoded 3 in the END block with ARGC-1 as shown there.

GNU awk specific approach that works with any number of files:
#!/usr/bin/gawk -f
BEGINFILE {
nfiles++
}
{
lines[$0][nfiles] = FNR
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (line in lines) {
if (length(lines[line]) == nfiles) {
for (file = 1; file <= nfiles; file++)
printf "%d\t", lines[line][file]
print line
}
}
}
Example:
$ ./showlines file[123].txt
1 3 2 apple

Could you please try following, written and tested with GNU awk, one could make use of ARGC value which gives us total number of element passed to awk program.
awk '
{
a[$0]=(a[$0]?a[$0] OFS:"")FNR
count[$0]++
}
END{
for(i in count){
if(count[i]==(ARGC-1)){
print i,a[i]
}
}
}
' file1.txt file2.txt file3.txt

A perl solution
perl -ne '
$h{$_} .= "$.\t"; # append current line number and tab character to value in a hash with key current line
$. = 0 if eof; # reset line number when end of file is reached
END{
while ( ($k,$v) = each %h ) { # loop over has entries
if ( $v =~ y/\t// == 3 ) { # if value contains 3 tabs
print $v.$k # print value concatenated with key
}
}
}' file1.txt file2.txt file3.txt

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Use awk to create index of words from file

I'm learning UNIX for school and I'm supposed to create a command line that takes a text file and generates a dictionary index showing the words (exluding articles and prepositions) and the lines where it appears in the file.
I found a similar problem as mine in: https://unix.stackexchange.com/questions/169159/how-do-i-use-awk-to-create-an-index-of-words-in-file?newreg=a75eebee28fb4a3eadeef5a53c74b9a8 The problem is that when I run the solution
$ awk '
{
gsub(/[^[:alpha:] ]/,"");
for(i=1;i<=NF;i++) {
a[$i] = a[$i] ? a[$i]", "FNR : FNR;
}
}
END {
for (i in a) {
print i": "a[i];
}
}' file | sort
The output contains special characters (which I don't want) like:
-Quiero: 21
Sancho,: 2, 4, 8
How can I remove all the special characters and excluding articles and prepositions?
$ echo This is this test. | # some test text
awk '
BEGIN{
x["a"];x["an"];x["the"];x["on"] # the stop words
OFS=", " # list separator to a
}
{
for(i=1;i<=NF;i++) # list words in a line
if($i in x==0) { # if word is not a stop word
$i=tolower($i) # lowercase it
gsub(/^[^a-z]|[^a-z]$/,"",$i) # remove leading and trailing non-alphabets
a[$i]=a[$i] (a[$i]==""?"":OFS) NR # add record number to list
}
}
END { # after file is processed
for(i in a) # in no particular order
print i ": " a[i] # ... print elements in a
}'
this: 1, 1
test: 1
is: 1

awk - Compare columns from two files and replace text in first file

I have two files. The first has 1 column and the second has 3 columns. I want to compare first columns of both files. If there is a coincidence, replace column 2 and 3 for specific values; if not, print the same line.
File 1:
$ cat file1
26
28
30
File 2:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,r,1510139756
27,a,0
28,r,1510244156
29,a,0
30,r,1510157364
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Desired output:
$ cat file2
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
I am using gawk to do this (it's inside a shell script and I am using solaris) but I can't get the output right. It only prints the lines that matches:
$fuente="file2"
gawk -v fuente="$fuente" 'FNR==NR{a[FNR]=$1; next}{print $1,$2="a",$3="0" }' $fuente file1 > file3
The output I got:
$ cat file3
26 a 0
28 a 0
30 a 0
awk one-liner:
awk 'NR==FNR{ a[$1]; next }$1 in a{ $2="a"; $3=0 }1' file1 FS=',' OFS=',' file2
The output:
1,a,0
2,a,0
22,a,0
23,a,0
24,a,0
25,a,0
26,a,0
27,a,0
28,a,0
29,a,0
30,a,0
31,a,0
32,a,0
33,r,1510276164
34,a,0
40,a,0
Really spread out for clarity; called (fuente.awk) like so:
awk -F \, -v fuente=file1 -f fuente.awk file2 # -F == IFS
BEGIN {
OFS="," # set OFS to make printing easier
while (getline x < fuente > 0) # safe way; read file into array
{
a[++i]=x # stuff indexed array
}
}
{ # For each line in file2
for (k=1 ; k<=i ; k++) # Lop over array (elements in file1)
{
if (($1==a[k]) && (! flag))
{
print($1,"a",0) # Found print new line
flag=1 # print only once
}
}
if (! flag) # Not found
{
print($0) # print original
}
flag=0 # reset flag
}
END { }

Compare headers of two delimited text files

File1.txt(base file)
header1|header2|header3|header4
1|2|3|4
File2.txt
header1|header10|header3|header4
5|6|7
Desired O/P
header2 is missing in file 2 at position 2
header10 is addition in file 2 at position 2
I need to compare two file header and need to display missing header or addition columns with respect to base file header list.
I would try it with the diff command like this:
diff <(head -n1 fh1.txt | tr "|" "\n") <( head -n1 fh2.txt | tr "|" "\n")
where fh1.txt and fh2.txt are your files. The output gives the information that you want but it is not so verbose.
You can use awk, like this:
check.awk
# In the first line of every input file save the headers
FNR==1{
headers[f++]=$0
}
# Once all lines of input have been processed ...
END{
# split() returns the number of items. The resulting
# arrays 'a|b_headers' will be indexed starting from 1
lena = split(headers[0],a_headers,"|")
lenb = split(headers[1],b_headers,"|")
for(h=1;h<=lena;h++) {
if(a_headers[h] != b_headers[h]) {
print a_headers[h] " missing from file2 at column " h
}
}
for(h=1;h<=lenb;h++) {
if(b_headers[h] != a_headers[h]) {
print b_headers[h] " missing from file1 at column " h
}
}
}
Call it like this:
awk -f check.awk File1.txt File2.txt
Output:
header2 missing from file2 at column 2
header10 missing from file1 at column 2

Sequence length of FASTA file

I have the following FASTA file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
The output I get with this code is:
>header1
60
57
>header2
3
>header3
7
I need a small modification in order to deal with multiple sequence lines.
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.
An awk / gawk solution can be composed by three stages:
Every time header is found these actions should be performed:
Print previous seqlen if exists.
Print tag.
Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.
Commented code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
For the totals:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa
A quick way with any awk, would be this:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
"seqkit" is a quick way:
seqkit fx2tab --length --name --header-line sequence.fa

Resources