awk if line contains - bash

Currently using,
$ awk 'NR==FNR{a[$1];next} ($3 in a)' find.txt path_to_100_files/*
to search a directory containing multiple files, for strings from a .txt (find.txt)
find.txt contains
example1
example 2
example#eampol.com
exa exa exa123
...
example of .txt files within directory
example example example.com
example 2 example example lol
now currently it searches for the string within column 3, using ($3 in a) meaning $3 = column #3, but sometimes string can be on $1 or $5 and so on, how can I get it to search every column instead of just the 3rd?

awk '
NR==FNR{a[$1];next}
{ for (i=1; i<=NF; i++) if ($i in a) { print; next } }
' find.txt path_to_100_files/*
The above assumes your existing script behaves as desired given exa exa exa123.

Related

Add location to duplicate names in a CSV file using Bash

Using Bash create user logins. Add the location if the name is duplicated. Location should be added to the original name, as well as to the duplicates.
id,location,name,login
1,KP,Lacie,
2,US,Pamella,
3,CY,Korrie,
4,NI,Korrie,
5,BT,Queenie,
6,AW,Donnie,
7,GP,Pamella,
8,KP,Pamella,
9,LC,Pamella,
10,GM,Ericka,
The result should look like this:
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,uspamella#mail.com
3,CY,Korrie,cykorrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
I used AWK to process the csv file.
cat data.csv | awk 'BEGIN {FS=OFS=","};
NR > 1 {
split($3, name)
$4 = tolower($3)
split($4, login)
for (k in login) {
!a[login[k]]++ ? sub(login[k], login[k]"#mail.com", $4) : sub(login[k], tolower($2)login[k]"#mail.com", $4)
}
}; 1' > data_new.csv
The script adds location values only to further duplicates.
id,location,name,login
1,KP,Lacie,lacie#mail.com
2,US,Pamella,pamella#mail.com
3,CY,Korrie,korrie#mail.com
4,NI,Korrie,nikorrie#mail.com
5,BT,Queenie,queenie#mail.com
6,AW,Donnie,donnie#mail.com
7,GP,Pamella,gppamella#mail.com
8,KP,Pamella,kppamella#mail.com
9,LC,Pamella,lcpamella#mail.com
10,GM,Ericka,ericka#mail.com
How do I add location to the initial one?
A common solution is to have Awk process the same file twice if you need to know whether there are duplicates down the line.
Notice also that this requires you to avoid the useless use of cat.
awk 'BEGIN {FS=OFS=","};
NR == FNR { ++seen[$3]; next }
FNR > 1 { $4 = (seen[$3] > 1 ? tolower($2) : "") tolower($3) "#mail.com" }
1' data.csv data.csv >data_new.csv
NR==FNR is true when you read the file the first time. We simply count the number of occurrences of $3 in seen for the second pass.
Then in the second pass, we can just look at the current entry in seen to figure out whether or not we need to add the prefix.

awk match substring in column from 2 files

I have the following two files (real data is tab-delimited instead of semicolon):
input.txt
Astring|2042;MAR0303;foo1;B
Dstring|2929;MAR0283;foo2;C
db.txt updated
TG9284;Astring|2042|morefoohere_foo_foo
TG9281;Cstring|2742|foofoofoofoofoo Dstring|2929|foofoofoo
So, column1 of input.txtis a substring of column2 of db.txt. Only two "fields" separated by | is important here.
I want to use awk to match these two columns and print the following (again in tab-delimited form):
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
This is my code:
awk -F'[\t]' 'NR==FNR{a[$1]=$1}$1 in a {print $0"\t"$1}' input.txt db.txt
EDIT
column2 of db.txt contains strings of column1 of input.txt, delimited by a space. There are many more strings in the real example than shown in the short excerpt.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{
split($2, b, "|"); a[b[1] "|" b[2]]=$1; next}
$1 in a {print $0, a[$1]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
EDIT:
As per your comment you can use:
awk 'BEGIN{FS=OFS="\t"} NR==FNR {
a[$2]=$1; next} {for (i in a) if (index(i, $1)) print $0, a[i]}' db.txt input.txt
Astring|2042 MAR0303 foo1 B TG9284
Dstring|2929 MAR0283 foo2 C TG9281
Going with the semicolons, you can replace with the tabs:
$ awk -F\; '
NR==FNR { # hash the db file
a[$2]=$1
next
}
{
for(i in a) # for each record in input file
if($1~i) { # see if $1 matches a key in a
print $0 ";" a[i] # output
# delete a[i] # delete entry from a for speed (if possible?)
break # on match, break from for loop for speed
}
}' db input # order order
Astring|2042;MAR0303;foo1;B;TG9284
Dstring|2929;MAR0283;foo2;C;TG9281
For each record in input script matches the $1 against every entry in db, so it's slow. You can speed it up by adding a break to the if and deleteing matching entry from a (if your data allows it).

Iterate through list in bash and run multiple grep commands

I would like to iterate through a list and grep for the items, then use awk to pull out important information from each grep result. (This is the way I thought to do it, but awk and grep aren't necessary if there is a better way).
The input file contains a number of lines that looks similar to this:
chr1 12345 . A G 3e-12 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;
I have a number of locations that should match some part of the second column.
locList="123, 789"
And for each matching location I would like to get the information from columns 4 and 5 and write them to an output file with the corresponding location.
So the output for the above list should be:
123 A G
Something like this is what I'm thinking:
for i in locList; do
grep i inputFile.txt | awk '{print $2,$4,$5}'
done
Invoking grep/awk once per location will be highly inefficient. You want to invoke a single command that will do your parsing. For example, awk:
awk -v locList="12345 789" '
BEGIN {
# parse the location list, and create an array where
# the locations are the array indexes
n = split(locList, a)
for (i=1; i<=n; i++) locations[a[i]] = 1
}
$2 in locations {print $2, $4, $5}
' file
revised requirements
awk -v locList="123 789" '
BEGIN { n = split(locList, patterns) }
{
for (i=1; i<=n; i++) {
if ($2 ~ "^" patterns[i]) {
print $2, $4, $5
break
}
}
}
' file
The ~ operator is the regular expression matching operator.
That will output 12345 A G from your sample input. If you just want to output 123 A G then print patterns[i] instead of $2.
awk -v locList='123|789' '$2~"^("locList")" {print $2,$4,$5}' file
or if you prefer:
locList='123, 789'
awk -v locList="^(${locList//, /|})" '$2~locList {print $2,$4,$5}' file
or whatever other permutation you like. The point is you don't need a loop at all - just create a regexp from the list of numbers in locList and test that regexp once.
What I would do :
locList="123 789"
for i in $locList; do awk -vvar=$i '$2 ~ var{print $4, $5}' file; done

Compare strings using awk command

I am at initial stage of learning shell scripting. So please explain me the steps for better understanding.
Consider I have two files
Content of the two files are as below:
File1.txt
ABC=10
DEF=20
XYZ=30
File2.txt
DEF=15
XYZ=20
I want to write a simple shell script to check both the files and add the values and print the final output as below. like
ABC=10
DEF=35
XYZ=50
You can use awk:
awk 'BEGIN{FS=OFS="="} FNR==NR{a[$1]=$2;next} {a[$1]+=$2}
END{for (i in a) print i, a[i]}' file1 file2
ABC=10
XYZ=50
DEF=35
Breakup:
NR == FNR { # While processing the first file
a[$1] = $2 # store the second field by the first in an array
next # move to next record
}
{ # while processing the second file
a[$1]+=$2 # add already stored value by 2nd field in 2nd file
}
END{..} # iterate the array and print the values
If you want to keep original ordering intact then use:
awk 'BEGIN{FS=OFS="="} FNR==NR{if (!($1 in a)) b[++n]=$1; a[$1]=$2;next} {a[$1]+=$2}
END{for (i=1; i<=n; i++) print b[i], a[b[i]]}' file1 file2
ABC=10
DEF=35
XYZ=50

Compare multiple Columns and Append the result into another file

I have two files file1 and file2, Both the files have 5 columns.
I want to compare first 4 columns of file1 with file2.
If they are equal, need to compare the 5th column. If 5th column values are different, need to print the file1's 5th column as file2's 6th column.
I have used below awk to compare two columns in two different files, but how to compare multiple columns and append the particular column in another file if matches found?
awk -F, 'NR==FNR{_1[$1]++;next}!_1[$1]'
file1:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,topcat
132,item2,grade3,wing7,middlecat
134,item2,grade3,wing7,middlecat
177,item8,gradeA,wing11,lowcat
file2:
111,item1,garde1,wing1,maingroup
123,item3,grade5,wing10,lowcat
132,item3,grade3,wing7,middlecat
126,item2,grade3,wing7,maingroup
177,item8,gradeA,wing11,lowcat
Desired output:
123,item3,grade5,wing10,lowcat,topcat
Awk can simulate multidimensional arrays by sequencing the indices. Underneath the indices are concatenated using the built-in SUBSEP variable as a separator:
$ awk -F, -v OFS=, 'NR==FNR { a[$1,$2,$3,$4]=$5; next } a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }' file1.txt file2.txt
123,item3,grade5,wing10,lowcat,topcat
awk -F, -v OFS=,
Set both input and output separators to ,
NR==FNR { a[$1,$2,$3,$4]=$5; next }
Create an associative array from the first file relating the first four fields of each line to the
fifth. When using a comma-separated list of values as an index, awk actually concatenates them
using the value of the built-in SUBSEP variable as a separator. This is awk's way of
simulating multidimensional arrays with a single subscript. You can set SUBSEP to any value you like
but the default, which is a non-printing character unlikely to appear in the data, is usually
fine. (You can also just do the trick yourself, something like a[$1 "|" $2 "|" $3 "|" $4],
assuming you know that your data contains no vertical bars.)
a[$1,$2,$3,$4] && a[$1,$2,$3,$4] != $5 { print $0,a[$1,$2,$3,$4] }
Arriving here, we know we are looking at the second file. If the first four fields were found in the
first file, and the $5 from the first file is different than the $5 in the second, print the line
from the second file followed by the $5 from the first. (I am assuming here that no $5 from the first file will have a value that evaluates to false, such as 0 or empty.)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $0; sub("(,[^,]*){"NF-4"}$","",key) }
NR==FNR { file1[key] = $5; next }
(key in file1) && ($5 != file1[key]) {
print $0, file1[key]
}
$ awk -f tst.awk file1 file2
123,item3,grade5,wing10,lowcat,topcat

Resources