Replace in one file with value from another file not working properly - bash

I have two files. A mapping file and an input file.
cat map.txt
test:replace
cat input.txt
The word test should be replaced.But the word testbook should not be
replaced just because it has "_test" in it.
Using the below command to find in the file and replace it with value in mapping file.
awk 'FNR==NR{ array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' FS=":" map.txt FS=" " input.txt
what it does is, searches for the text which are mentioned in map.txt and replace with the word followed after " : " in the same input file.
In the above example "test" with "replace".
Current result:
The word replace should be replaced.But the word replacebook should not be replaced just because it has _replace in it.
Expected Result:
The word replace should be replaced.But the word testbook should not be replaced just because it has "_test" in it.
so what i need is only if that word alone is found it has to be replaced. If that word has any other character clubbed then it should be ignored.
Any help is appreciated.
Thanks in advance.

With GNU awk for word boundaries:
awk -F':' '
NR==FNR { map[$1] = $2; next }
{
for (old in map) {
new = map[old]
gsub("\\<"old"\\>",new)
}
print
}
' map input
The above will fail if old contains regexp metacharacters or escape characters or if new contains & but as long as both use word consituent characters it'll be fine.

for loop all the words and replace where needed:
$ awk '
NR==FNR { # hash the map file
a[$1]=$2
next
}
{
for(i=1;i<=NF;i++) # loop every word and if it s hashed, replace it
if($i in a) # ... and if it s hashed...
$i=a[$i] # replace it
}1
' FS=":" map FS=" " input
The word replace should be replaced.But the word testbook should not be replaced just because it has "_test" in it.
Edit: Using match to extract words from strings to preserve punctuations:
$ cat input2
Replace would Yoda test.
$ awk '
NR==FNR { # hash the map file
a[$1]=$2
next
}
{
for(i=1;i<=NF;i++) {
# here should be if to weed out obvious non-word-punctuation pairs
# if($i ~ /^[a-zA-Z+][,\.!?]/)
match($i,/^[a-zA-Z]+/) # match from beginning of word. ¿correct?
w=substr($i,RSTART,RLENGTH) # extract word
if(w in a) # match in a
sub(w,a[w],$i)
}
}1' FS=":" map FS=" " input
Replace would Yoda replace.

Related

combining numbers from multiple text files using bash

I'm strugling to combine some data from my txt files generated in my jenkins job.
on each of the files there is 1 line, this is how each file look:
testsuite name="mytest" cars="201" users="0" bus="0" bike="0" time="116.103016"
What I manage to do for now is to extract the numbers for each txt file:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.txt
Result are :
cars="193" users="2" bus="0" bike="0"
cars="23" users="2" bus="10" bike="7"
cars="124" users="2" bus="5" bike="0"
cars="124" users="2" bus="0" bike="123"
now I have a random number of files like this:
my-output1.txt
my-output2.txt
my-output7.txt
my-output*.txt
I would like to create single command just like the one I did above and to sum all of the files to have the following echo result:
cars=544 users=32 bus=12 bike=44
is there a way to do that? with a single line of command?
Using awk
$ cat script.awk
BEGIN {
FS="[= ]"
} {
gsub(/"/,"")
for (i=1;i<NF;i++)
if ($i=="cars") cars+=$(i+1)
else if($i=="users") users+=$(i+1);
else if($i=="bus") bus+=$(i+1);
else if ($i=="bike")bike+=$(i+1)
} END {
print "cars="cars,"users="users,"bus="bus,"bike="bike
}
To run the script, you can use;
$ awk -f script.awk my-output*.txt
Or, as a ugly one liner.
$ awk -F"[= ]" '{gsub(/"/,"");for (i=1;i<NF;i++) if ($i=="cars") cars+=$(i+1); else if($i=="users") users+=$(i+1); else if($i=="bus") bus+=$(i+1); else if ($i=="bike")bike+=$(i+1)}END{print"cars="cars,"users="users,"bus="bus,"bike="bike}' my-output*.txt
1st solution: With your shown samples please try following awk code, using match function in here. Since awk could read multiple files within a single program itself and your files have .txt format you can pass as .txt format to awk program itself.
Written and tested in GNU awk with its match function's capturing group capability to create/store values into an array to be used later on in program.
awk -v s1="\"" '
match($0,/[[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"/,tempArr){
temp=""
for(i=2;i<=8;i+=2){
temp=tempArr[i-1]
values[i]+=tempArr[i]
indexes[i-1]=temp
}
}
END{
for(i in values){
val=(val?val OFS:"") (indexes[i-1]"=" s1 values[i] s1)
}
print val
}
' *.txt
Explanation:
In start of GNU awk program creating variable named s1 to be set to " to be used later in the program.
Using match function in main program of awk.
Mentioning regex [[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"(explained at last of this post) which is creating 8 groups to be used later on.
Then once condition is matched running a for loop which runs only even numbers in it(to get required values only).
Creating array values with index of i and keep adding its own value + tempArr values to it, where tempArr is created by match function.
Similarly creating indexes array to store only key values in it.
Then in END block of this program traversing through values array and printing the values from indexes and values array as per requirement.
Explanation of regex:
[[:space:]]+ ##Matching spaces 1 or more occurrences here.
(cars)="([^"]*)" ##Matching cars=" till next occurrence of " here.
(users)="([^"]*)" ##Matching spaces followed by users=" till next occurrence of " here.
(bus)="([^"]*)" ##Matching spaces followed by bus=" till next occurrence of " here.
(bike)="([^"]*)" ##Matching spaces followed by bike=" till next occurrence of " here.
2nd solution: In GNU awk only with using RT and RS variables power here. This will make sure the sequence of the values also in output should be same in which order they have come in input.
awk -v s1="\"" -v RS='[[:space:]][^=]*="[^"]*"' '
RT{
gsub(/^ +|"/,"",RT)
num=split(RT,arr,"=")
if(arr[1]!="time" && arr[1]!="name"){
if(!(arr[1] in values)){
indexes[++count]=arr[1]
}
values[arr[1]]+=arr[2]
}
}
END{
for(i=1;i<=count;i++){
val=(val?val OFS:"") (indexes[i]"=" s1 values[indexes[i]] s1)
}
print val
}
' *.txt
You may use this awk solution:
awk '{
for (i=1; i<=NF; ++i)
if (split($i, a, /=/) == 2) {
gsub(/"/, "", a[2])
sums[a[1]] +=a[2]
}
}
END {
for (i in sums) print i "=" sums[i]
}' file*
bus=15
cars=464
users=8
bike=130
found a way to do so a bit long:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.xml | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | awk '{bus+=$1;users+=$2;cars+=$3;bike+=$4 }END{print "bus=" bus " users="users " cars=" cars " bike=" bike}'
M. Nejat Aydin answer was good fit:
awk -F '[ "=]+' '/testsuite name=/{ cars+=$5; users+=$7; buses+=$9; bikes+=$11 } END{ print "cars="cars, "users="users, "buses="buses, "bikes="bikes }' my-output*.xml

AWK Script: Matching Two File with Unique Identifier and append record if already match

I'm trying to comparing two files with field as unique identifier to match.
With file 1 having account number and compare with second file.
If account number both file, next is condition to match the value and append to the original file or record.
Sample file 1:
ACCT1,PHONE1,TEST1
ACCT2,PHONE2,TEST3
Sample file 2:
ACCT1,SOMETHING1
ACCT1,SOMETHING3
ACCT1,SOMETHING1
ACCT1,SOMETHING3
ACCT2,SOMETHING1
ACCT2,SOMETHING3
ACCT2,SOMETHING1
ACCT2,SOMETHING1
But since the awk always gets the last occurrences of the file even there is already match before the end of record.
Actual Output base with condition below:
ACCT1,PHONE1,TEST1,000
ACCT2,PHONE2,TEST3,001
Expected Output:
ACCT1,PHONE1,TEST1,001
ACCT2,PHONE2,TEST3,001
Code I'm trying to:
awk -f test.awk pass=0 samplefile2.txt pass=1 samplefile1.txt > output.txt
BEGIN{
}
pass==0{
FS=","
ACT=$1
RES1[ACT]=$2
}
pass==1{
ACCTNO=$1
PHNO=$2
FIELD3=$3
LVCODE=RES1[ACCTNO]
if(LVCODE=="SOMETHING1"){ OTHERFLAG="001" }
else if(LVCODE=="SOMETHING4"){ OTHERFLAG="002" }
else{ OTHERFLAG="000" }
printf("%s\,", ACCTNO)
printf("%s\,", PHNO)
printf("%s\,", FIELD3)
printf("%s", OTHERFLAG)
printf "\n"
}
I'm trying to loop the variable that holds array, unfortunately it turns to infinite loop during my run.
You may use this awk command:
awk '
BEGIN {FS=OFS=","}
NR==FNR {
map[$1] = $0
next
}
$1 in map {
print map[$1], ($2 == "SOMETHING1" ? "001" : ($2 == "SOMETHING4" ? "002" : "000"))
delete map[$1]
}' file1 file2
ACCT1,PHONE1,TEST1,001
ACCT2,PHONE2,TEST3,001
Once we print a matching record from file2 we delete record from associative array map to ensure only first matching record is evaluated.
It sounds like you want to know the first occurrence of ACCTx in samplefile2.txt if SOMETHING1 or SOMETHING4 is present. I think you should read samplefile1.txt first into a data struction and then iterate line by line in samplefile2.txt looking for your criteria
BEGIN {
FS=","
while (getline < ACCOUNTFILE ) accounts[$1]=$0
}
{ OTHERFLAG = "" }
$2 == "SOMETHING1" { OTHERFLAG="001" }
$2 == "SOMETHING4" { OTHERFLAG="002" }
($1 in accounts) && OTHERFLAG!="" {
print(accounts[$1] "," OTHERFLAG)
# delete the accounts so that it does not print again.
# Only the first occurrence in samplefile2.txt will matter.
delete accounts[$1]
}
END {
# Print remaining accounts that did not match above
for (acct in accounts) print(accounts[acct] ",000")
}
Run above with:
awk -v ACCOUNTFILE=samplefile1.txt -f test.awk samplefile2.txt
I am not sure what you want to do if both SOMETHING1 and SOMETHING4 are in samplefile2.txt for the same ACCT1. If you want 'precedence' so that SOMETHING4 will overrule SOMETHING1 if it comes after you will need additional logic. In that case you probably want to avoid the 'delete' and keep updating the accounts[$1] array until you reach the end of the file and then print all the accounts at the end.

Use awk to create index of words from file

I'm learning UNIX for school and I'm supposed to create a command line that takes a text file and generates a dictionary index showing the words (exluding articles and prepositions) and the lines where it appears in the file.
I found a similar problem as mine in: https://unix.stackexchange.com/questions/169159/how-do-i-use-awk-to-create-an-index-of-words-in-file?newreg=a75eebee28fb4a3eadeef5a53c74b9a8 The problem is that when I run the solution
$ awk '
{
gsub(/[^[:alpha:] ]/,"");
for(i=1;i<=NF;i++) {
a[$i] = a[$i] ? a[$i]", "FNR : FNR;
}
}
END {
for (i in a) {
print i": "a[i];
}
}' file | sort
The output contains special characters (which I don't want) like:
-Quiero: 21
Sancho,: 2, 4, 8
How can I remove all the special characters and excluding articles and prepositions?
$ echo This is this test. | # some test text
awk '
BEGIN{
x["a"];x["an"];x["the"];x["on"] # the stop words
OFS=", " # list separator to a
}
{
for(i=1;i<=NF;i++) # list words in a line
if($i in x==0) { # if word is not a stop word
$i=tolower($i) # lowercase it
gsub(/^[^a-z]|[^a-z]$/,"",$i) # remove leading and trailing non-alphabets
a[$i]=a[$i] (a[$i]==""?"":OFS) NR # add record number to list
}
}
END { # after file is processed
for(i in a) # in no particular order
print i ": " a[i] # ... print elements in a
}'
this: 1, 1
test: 1
is: 1

Ignore delimiters in quotes and excluding columns dynamically in csv file

I have awk command to read the csv file with | sperator. I am using this command as part of my shell script where the columns to exclude will be removed from the output. The list of columns are input as 1 2 3
Command Reference: http://wiki.bash-hackers.org/snipplets/awkcsv
awk -v FS='"| "|^"|"$' '{for i in $test; do $(echo $i=""); done print }' test.csv
$test is 1 2 3
I want to print $1="" $2="" $3="" in front of print all columns. I am getting this error
awk: {for i in $test; do $(echo $i=""); done {print }
awk: ^ syntax error
This command is working properly which prints all the columns
awk -v FS='"| "|^"|"$' '{print }' test.csv
File 1
"first"| "second"| "last"
"fir|st"| "second"| "last"
"firtst one"| "sec|ond field"| "final|ly"
Expected output if I want to exclude the column 2 and 3 dynamically
first
fir|st
firtst one
I need help to keep the for loop properly.
With GNU awk for FPAT:
$ awk -v FPAT='"[^"]+"' '{print $1}' file
"first"
"fir|st"
"firtst one"
$ awk -v flds='1' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"first"
"fir|st"
"firtst one"
$ awk -v flds='2 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"second" "last"
"second" "last"
"sec|ond field" "final|ly"
$ awk -v flds='3 1' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"last" "first"
"last" "fir|st"
"final|ly" "firtst one"
If you don't want your output fields separated by a blank char then set OFS to whatever you do want with -v OFS='whatever'. If you want to get rid of the surrounding quotes you can use gensub() (since we're using gawk anyway) or substr() on every field, e.g.:
$ awk -v OFS=';' -v flds='1 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", substr($(f[i]),2,length($(f[i]))-2), (i<n?OFS:ORS)}' file
first;last
fir|st;last
firtst one;final|ly
$ awk -v OFS=';' -v flds='1 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", gensub(/"/,"","g",$(f[i])), (i<n?OFS:ORS)}' file
first;last
fir|st;last
firtst one;final|ly
In GNU awk (for FPAT):
$ test="2 3" # fields to exclude in bash var $test
$ awk -v t="$test" ' # taken to awk var t
BEGIN { # first
FPAT="([^|]+)|( *\"[^\"]+\")" # instead of FS, use FPAT
split(t,a," ") # process t to e:
for(i in a) # a[1]=2 -> e[2], etc.
e[a[i]]
}
{
for(i=1;i<=NF;i++) # for each field
if((i in e)==0) { # if field # not in e
gsub(/^\"|\"$/,"",$i) # remove leading and trailing "
b=b (b==""?"":OFS) $i # put to buffer b
}
print b; b="" # putput and reset buffer
}' file
first
fir|st
firtst one
FPAT is used as FS can't handle separator in quotes.
Vikram, if your actual Input_file is DITTO same as shown sample Input_file then following may help you in same. I will add explanation shortly too here(tested this with GNU awk 3.1.7 little old version of awk).
awk -v num="2,3" 'BEGIN{
len=split(num, val,",")
}
{while($0){
match($0,/.[^"]*/);
if(substr($0,RSTART,RLENGTH+1) && substr($0,RSTART,RLENGTH+1) !~ /\"\| \"/ && substr($0,RSTART,RLENGTH+1) !~ /^\"$/ && substr($0,RSTART,RLENGTH+1) !~ /^\" \"$/){
array[++i]=substr($0,RSTART,RLENGTH+1)
};
$0=substr($0,RLENGTH+1);
};
for(l=1;l<=len;l++){
delete array[val[l]]
};
for(j=1;j<=length(array);j++){
if(array[j]){
gsub(/^\"|\"$/,"",array[j]);
printf("%s%s",array[j],j==length(array)?"":" ")
}
};
print "";
i="";
delete array
}' Input_file
EDIT1: Adding a code with explanation too here.
awk -v num="2,3" 'BEGIN{ ##creating a variable named num whose value is comma seprated values of fields which you want to nullify, starting BEGIN section here.
len=split(num, val,",") ##creating an array named val here whose delimiter is comma and creating len variable whose value is length of array val here.
}
{while($0){ ##Starting a while loop here which will run for a single line till that line is NOT getting null.
match($0,/.[^"]*/);##using match functionality which will look for matches from starting to till a " comes into match.
if(substr($0,RSTART,RLENGTH+1) && substr($0,RSTART,RLENGTH+1) !~ /\"\| \"/ && substr($0,RSTART,RLENGTH+1) !~ /^\"$/ && substr($0,RSTART,RLENGTH+1) !~ /^\" \"$/){##So RSTATR and RLENGTH are the variables which will be set when a regex is having a match in line/variable passed into match function. In this if condition I am checking 1st: value of substring of RSTART,RLENGTH+1 should not be NULL. 2nd: Then checking this substring should not be having " pipe space ". 3rd condition: Checking if substring is NOT equal to a string which starts from " and ending with it. 4th condition: Checking here if substring is NOT equal to ^" space "$, if all conditions are TRUE then do following actions.
array[++i]=substr($0,RSTART,RLENGTH+1) ##creating an array named array whose index is variable i with increasing value of i and its value is substring of RSTART to till RLENGTH+1.
};
$0=substr($0,RLENGTH+1);##Now removing the matched part from current line which will decrease the length of line and avoid the while loop to become as infinite.
};
for(l=1;l<=len;l++){##Starting a loop here once while above loop is done which runs from starting of variable l=1 to value of len.
delete array[val[l]] ##Deleting here those values which we want to REMOVE from OPs request, so removing here.
};
for(j=1;j<=length(array);j++){##Start a for loop from the value of j=1 till the value of lengthh of array.
if(array[j]){ ##Now making sure array value whose index is j is NOT NULL, if yes then perform following statements.
gsub(/^\"|\"$/,"",array[j]); ##Globally substituting starting " and ending " with NULL in value of array value.
printf("%s%s",array[j],j==length(array)?"":" ") ##Now printing the value of array and secondly printing space or null depending upon if j value is equal to array length then print NULL else print space. It is because we don not want space at the last of the line.
}
};
print ""; ##Because above printf will NOT print a new line, so printing a new line.
i=""; ##Nullifying variable i here.
delete array ##Deleting array here.
}' Input_file ##Mentioning Input_file here.

AWK find if line is newline or #

I have the following, it's ignoring the lines with just # but not those with \n (empty/ just containing newline lines)
Do you know of a way I can hit two birds with one stone?
I.E. if the lines don't contain more than 1 char, then delete the line..
function check_duplicates {
awk '
FNR==1{files[FILENAME]}
{if((FILENAME, $0) in a) dupsInFile[FILENAME]
else
{a[FILENAME, $0]
dups[$0] = $0 in dups ? (dups[$0] RS FILENAME) : FILENAME
count[$0]++}}
{if ($0 ~ /#/) {
delete dups[$0]
}}
#Print duplicates in more than one file
END{for(k in dups)
{if(count[k] > 1)
{print ("\n\nDuplicate line found: " k) " - In the following file(s)"
print dups[k] }}
printf "\n";
}' $SITEFILES
awk '
NR {
b[$0]++
}
$0 in b {
if ($0 ~ /#/) {
delete b[$0]
}
if (b[$0]>1) {
print ("\n\nRepeated line found: "$0) " - In the following file"
print FILENAME
delete b[$0]
}
}' $SITEFILES
}
The expected input is usually as follows.
#File Path's
/path/to/file1
/path/to/file2
/path/to/file3
/path/to/file4
#
/more/paths/to/file1
/more/paths/to/file2
/more/paths/to/file3
/more/paths/to/file4
/more/paths/to/file5
/more/paths/to/file5
In this case, /more/paths/to/file5, occurs twice, and should be flagged as such.
However, there are also many newlines, which I'd rather ignore.
Er, it also has to be awk, I'm doing a tonne of post processing, and don't want to vary from awk for this bit, if that's okay :)
It really seems to be a bit tougher than I would have expected.
Cheers,
Ben
You can combine both the if into a single regex.
if ($0 ~ /#|\n/) {
delete dups[$0]
}
OR
To be more specific you can write
if ($0 ~ /^#?$/) {
delete dups[$0]
}
What it does
^ Matches starting of the line.
#? Matches one or zero #
$ Matches end of line.
So, ^$ matches empty lines and ^#$ matches lines with only #.

Resources