Related
Basically, I want to check if strings present in lookup_1 & lookup_2 exists in my xyz.txt file then perform action & redirect output to an output file. Also, my code is currently substituting all occurrences of the strings in lookup_1 even as substring, but I need it to only substitute if there's a whole word match.
Can you please help in tweaking the code to achieve the same?
code
awk '
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
{ for (i in lookups) {
ndx=index($0,i)
while (ndx > 0) { t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
ndx=index($0,i)
}
}
print
}
' lookup_1 xyz.txt > output.txt
lookup_1
ha
achine
skhatw
at
ree
ter
man
dun
lookup_2
United States
CDEXX123X
Institution
xyz.txt
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States
current output
[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States
desired output
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
We can make a couple changes to the current code:
feed the results of cat lookup_1 lookup_2 into awk such that it looks like a single file to awk (see last line of new code)
use word boundary flags (\< and \>) to build regexes with which to perform the replacements (see 2nd half of new code)
The new code:
awk '
# the FNR==NR block of code remains the same
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
# complete rewrite of the following block to perform replacements based on a regex using word boundaries
{ for (i in lookups) {
regex= "\\<" i "\\>" # build regex
gsub(regex,lookups[i]) # replace strings that match regex
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code
This generates:
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
NOTES:
the 'boundary' characters (\< and \>) match on non-word characters; in awk a word is defined as a sequence of numbers, letters and underscores; see GNU awk - regex operators for more details
all of the sample lookup values fall within the definition of an awk word so this new code works as desired
your previous question includes lookup values that cannot be considered as an awk 'word' (eg, #vanti Finserv Co., 11:11 - Capital, MS&CO(NY)) in which case this new code may fail to replace these new lookup values
for lookup values that contain non-word characters it's not clear how you would define 'whole word match' as you would also need to determine when a non-word character (eg, #) is to be treated as part of a lookup string vs being treated as a word boundary
If you need to replace lookup values that contain (awk) non-word characters you could try replacing the word-boundary characters with \W, though this then causes problems for the lookup values that are (awk) 'words'.
One possible workaround may be to run a dual set of regex matches for each lookup value, eg:
awk '
FNR==NR { ... no changes to this block of code ... }
{ for (i in lookups) {
regex= "\\<" i "\\>"
gsub(regex,lookups[i])
regex= "\\W" i "\\W"
gsub(regex,lookups[i])
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt
You'll need to determine if the 2nd regex breaks your 'whole word match' requirement.
Convert file input.csv.
id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith#example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,
(Done!) Need to update column name. Name format: first letter of name/surname uppercase and all other letters lowercase.
(Done!) Need to update column email with domain #abc.Email format: first letter from name and full surname, lowercase
(Not done) Emails with the same ID should contain numbers. Example: Name Max Houston, email mhouston1#examples.com etc.
#!/bin/bash
inputfile="accounts.csv"
echo "id,location_id,organization_id,service_id,name,title,email,department" > accounts_new.csv
while IFS="," read -r rec_column1 rec_column2 rec_column3 rec_column4 rec_column5 rec_column6 rec_column7 rec_column8
do
surnametemp="${rec_column5:0:1}$(echo $rec_column5 | awk '{print $2}')"
namesurname=$(echo $rec_column5 | sed 's! .!\U&!g')
echo $rec_column1","$rec_column2","$rec_column3","$rec_column4","$namesurname","$rec_column6",""${surnametemp,,}#abc.com"","$rec_column8 >>accounts_new.csv
done < <(tail -n +2 $inputfile)
How can do that?
Outputfile
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy2#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
Task specification
This task would be much easier if specified otherwise:
add email iterator to every email
or
add email iterator to second,third... occurrence
But it was specified:
add email iterator to every email if email is used multiple times.
This specification requires double iteration through lines, thus making this task more difficult.
The right tool
My rule of thumb is: use pure bash tools (grep, sed, etc) for simple tasks, use awk for moderate tasks and python for complicated tasks. In this case (double iteration over lines) I would use python. However, there was not python tag in problem specification, so I used awk.
Solution
<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN {
OFS = ","
};
{
if ($7 == "") {
split($5,name," ");
firstname = substr(tolower(name[1]),1,1);
lastname = tolower(name[2]);
domain="#abc.com";
$7=firstname "." lastname domain;
};
emailcounts[$7]++;
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8;
emails[iter]=$7;
}
END {
for (iter in immutables) {
if (emailcounts[emails[iter]] > 1) {
emailiter[emails[iter]]++;
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]);
} else {
email=emails[iter]
};
print immutables[iter], email
}
}'
Results
id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith#example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1#abc.com
14,9,,,David Genesy,Library Director,,d.genesy2#abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza#abc.com
Explanation
-vFPAT='[^,]*|"[^"]*"' read csv
$7=firstname "." lastname domain;} substitute email field
emailcounts[$7]++ count email occurences
iter iterator to preserve order
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8 save non email fields for second loop
emails[iter]=$7 save email for second loop
for (iter in immutables) iterate over keys in immutables dictionary
{if (emailcounts[emails[iter]] > 1) change email if more than 1 occurence
emailiter[emails[iter]]++ increase email iterator
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]) add iterator to email
print immutables[iter], email print
With the input (mailcsv) file as:
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
You can use awk and so:
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"#");$7=mail1[1]mails[$7]"#"mail1[2] } else { $0=$0 } }1' mailscsv
Set the field delimiter to , and then create an array keyed by email address. Increment the index every time the email address is encountered. If there is more than one occurrence of the address, split the address into another array mail1 based on "#". Set $7 to the first index of the array mail1 (email address before #) followed by the value of mails index for the email address, then "#" and the second index of mail1 (the section after #) If there is only one occurrence of the email address simple set the whole line as is. Use 1 to print the line.
I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}
I am trying to understand the following code used to pull out overlapping lines over multiple files using BASH.
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
sprintf("\t%-20s -->\t%s", rec[R], R)
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}' file[a-d]
After understanding what each sub-block of code is doing, I would like to extend this code to find specific fields with overlap, rather than the entire line. For example, I have tried changing the line:
n = split(rec[R], t, "/")
to
n = split(rec[R$1], t, "/")
to find the lines where the first field is the same across all files but this did not work. Eventually I would like to extend this to check that a line has fields 1, 2, and 4 the same, and then print the line.
Specifically, for the files mentioned in the example in the link:
if file 1 is:
chr1 31237964 NP_055491.1 PUM1 M340L
chr1 33251518 NP_037543.1 AK2 H191D
and file 2 is:
chr1 116944164 NP_001533.2 IGSF3 R671W
chr1 33251518 NP_001616.1 AK2 H191D
chr1 57027345 NP_001004303.2 C1orf168 P270S
I would like to pull out:
file1/file2 --> chr1 33251518 AK2 H191D
I found this code at the following link:
http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738. Specifically, I would like to understand what R, rec, n, dup, and D represent from the files themselves. It is unclear from the comments provided and printf statements I've added within the subloops fail.
Thank you very much for any insight on this!
The script works by building an auxiliary array, the indices of which are the lines in the input files (denoted by $0 in rec[$0]), and the values are filename1/filename3/... for those filenames in which the given line $0 is present. You can hack it up to just work with $1,$2 and $4 like so:
awk 'END {
# the END block is executed after
# all the input has been read
# loop over the rec array
# and build the dup array indxed by the nuber of
# filenames containing a given record
for (R in rec) {
n = split(rec[R], t, "/")
if (n > 1) {
split(R,R1R2R4,SUBSEP)
dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
}
}
# loop over the dup array
# and report the number and the names of the files
# containing the record
for (D in dup) {
printf "records found in %d files:\n\n", D
printf "%s\n\n", dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the partial content of the current record
# (special concatenation of $1, $2 and $4)
# concatenating the filenames separated by / as values
rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
}' file[a-d]
this solution makes use of multidimensional arrays: we create rec[$1,$2,$4] instead of rec[$0]. This special syntax of awk concatenates the indices with the SUBSEP character, which is by default non-printable ("\034" to be precise), and so it is unlikely to be part of either of the fields. In effect it does rec[$1 SUBSEP $2 SUBSEP $4]=.... Otherwise this part of the code is the same. Note that it would be more logical to move the second block to the beginning of the script, and finish with the END block.
The first part of the code also has to be changed: now for (R in rec) loops over these tricky concatenated indices, $1 SUBSEP $2 SUBSEP $4. This is good while indexing, but you need to split R at the SUBSEP characters to obtain again the printable fields $1, $2, $4. These are put into the array R1R2R4, which can be used to print the necessary output: instead of %s,...,R we now have %s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],. In effect we're doing sprintf ...%s,...,$1,$2,$4; with pre-saved fields $1, $2, $4. For your input example this will print
records found in 2 files:
foo11.inp1/foo11.inp2 --> chr1 33251518 AK2
Note that the output is missing H191D but rightly so: that is not in field 1, 2 or 4 (but rather in field 5), so there's no guarantee that it is the same in the printed files! You probably don't want to print that, or anyway have to specify how you should treat the columns which are not checked between files (and so may differ).
A bit of explanation for the original code:
rec is an array, the indices of which are full lines of input, and the values are the slash-separated list of files in which those lines appear. For instance, if file1 contains a line "foo bar", then rec["foo bar"]=="file1" initially. If then file2 also contains this line, then rec["foo bar"]=="file1/file2". Note that there are no checks for multiplicity, so if file1 contains this line twice, then eventually you'll get rec["foo bar"]=file1/file1/file2 and obtain 3 for the number of files containing this line.
R goes over the indices of the array rec after it has been fully built. This means that R will eventually assume each unique line of every input file, allowing us to loop over rec[R], containing the filenames in which that specific line R was present.
n is a return value from split, which splits the value of rec[R] --- that is the filename list corresponding to line R --- at each slash. Eventually the array t is filled with the list of files, but we don't make use of this, we only use the length of the array t, i.e. the number of files in which line R is present (this is saved in the variable n). If n==1, we don't do anything, only if there are multiplicities.
the loop over n creates classes according to the multiplicity of a given line. n==2 applies to lines that are present in exactly 2 files. n==3 to those which appear thrice, and so on. What this loop does is that it builds an array dup, which for every multiplicity class (i.e. for every n) creates the output string "filename1/filename2/... --> R", with each of these strings separated by RS (the record separator) for each value of R that appears n times total in the files. So eventually dup[n] for a given n will contain a given number of strings in the form of "filename1/filename2/... --> R", concatenated with the RS character (by default a newline).
The loop over D in dup will then go through multiplicity classes (i.e. valid values of n larger than 1), and print the gathered output lines which are in dup[D] for each D. Since we only defined dup[n] for n>1, D starts from 2 if there are multiplicities (or, if there aren't any, then dup is empty, and the loop over D will not do anything).
first you'll need to understand the 3 blocks in an AWK script:
BEGIN{
# A code that is executed once before the data processing start
}
{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}
END{
# A code that is executed once after all data processing finished
}
so you'll probably need to edit this part of the script:
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
}
I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.