Remove duplicate lines in braces - bash

I a have file that contains:
l1_lololo {
abcdef
vgjhklfgkchbnvu
gfuhjfythkjbgftyhkjgyftuihgt6
deefgik
abcdef
}
l2_blabla {
123456
vgghyfthjfgtrdygfhhbnvu
gfuhjgvftdyfgvjgyftuihgt6
deiulouk
123456
}
I need to check text in braces with sed/awk/bash/etc. and remove duplicates of lines, leaving only first of recurring line in each braces, I need to get this:
l1_lololo {
abcdef
vgjhklfgkchbnvu
gfuhjfythkjbgftyhkjgyftuihgt6
deefgik
}
l2_blabla {
123456
vgghyfthjfgtrdygfhhbnvu
gfuhjgvftdyfgvjgyftuihgt6
deiulouk
}
How I can do this?

If you can guarantee that the blocks end with a line containing only }, it could be done as simply as:
awk '/^}$/ {delete a} !a[$0]++' input
If you need a more robust solution, perhaps just add some whitespace to the pattern to match the end of a block. But if you want a full parser and want to match braces carefully, awk is probably not suited for the task.

If you're open to other languages, this is really easy to do in tcl thanks to the input being in tcl list format, allowing you to use it to do all the parsing without any potentially fragile regular expressions:
#!/usr/bin/env tclsh
package require Tcl 8.5
foreach {key lst} [read stdin] {
foreach item $lst { dict set seen $item 1 }
puts "$key {\n\t[join [dict keys $seen] \n\t]\n}\n"
unset seen
}
Example:
$ ./dedup < input.txt
l1_lololo {
abcdef
vgjhklfgkchbnvu
gfuhjfythkjbgftyhkjgyftuihgt6
deefgik
}
l2_blabla {
123456
vgghyfthjfgtrdygfhhbnvu
gfuhjgvftdyfgvjgyftuihgt6
deiulouk
}

Desired result can be achieved with following code (data stored in a hash)
use strict;
use warnings;
use feature 'say';
my $data = do{ local $/; <DATA> }; # read whole data
my %seen;
my %records = $data =~ /(\w+)\s+\{\s*(.*?)\s*\}/sg; # split into records
while( my($k,$v) = each %records ) { # for each record split into array
my #array = map { if( not $seen{$_} ) { $seen{$_} = 1; $_ } } split '\s+', $records{$k}; # store uniq elements
pop #array; # pop out last empty element
$records{$k} = \#array; # store array in hash
}
while( my($k,$v) = each %records ) { # each record
say "$k = {"; # output hash key
say "\t$_" for #{$v}; # output each element of array
say "}\n"; # done
}
__DATA__
l1_lololo {
abcdef
vgjhklfgkchbnvu
gfuhjfythkjbgftyhkjgyftuihgt6
deefgik
abcdef
}
l2_blabla {
123456
vgghyfthjfgtrdygfhhbnvu
gfuhjgvftdyfgvjgyftuihgt6
deiulouk
123456
}
Output
l1_lololo = {
abcdef
vgjhklfgkchbnvu
gfuhjfythkjbgftyhkjgyftuihgt6
deefgik
}
l2_blabla = {
123456
vgghyfthjfgtrdygfhhbnvu
gfuhjgvftdyfgvjgyftuihgt6
deiulouk
}

This might work for you (GNU sed):
sed -E '/^\S+ \{/{:a;N;s/((\n[^\n]*)(\n.*)*)\2$/\1/;/\n\}$/!ba}' file
If a line begins with some text followed by a {, append the next line and remove the last line if it matches a preceding line. Repeat the latter until a line containing only a } and print the result.

Related

Bash script to compare and generate csv datafile

I have two CSV files data1.csv and data2.csv the content is something like this (with headers) :
DATA1.csv
Client Name;strnu;addr;fav
MAD01;HDGF;11;V PO
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V ZZ
DATA2.csv
USER;BINin;TYPE
XXMAD01XXXHDGFXX;11;N
KJDGD;635;M
CVOJF01XXHHD;635;N
Issues :
The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv.
For example MAD01;HDGF exist in the first column of DATA2 ***MAD01***HDGF** (* can be alphanum and/or symbols charachter) and MAD01;HDGF might not be in the same order in the column USER of DATA2.
The value of strnum in DATA1 is equal to the value of the column BINin in DATA2
The column fav DATA1 is the same as TYPE in DATA2 because V T = M and V PO = N (some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored)
N.B: some data may exist in a file but not the other.
my bash script needs to generate a new CSV file that should contain:
The column USER from DATA2
Client Name and strnu from DATA1
BINin from DATA2 only if it's equal to the corespondent line and value of strnu DATA1
TYPE using M N Format and making sure to respect the condition that V T = M and V PO = N
The first thing i tried was usuing grep to search for lines that exist in both files
#!/bin/sh
DATA1="${1}"
DATA2="${2}"
for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do
grep "$i" $DATA2
done
Result :
$ ./script.sh DATA1.csv DATA2.csv
MAD01;HDGF;11;V PO
XXMAD01XXXHDGFXX;11;N
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V PO
Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the - and other special characters present in column 2 of DATA1 but they can be ignored.
I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;M
This can be done in a single awk program. This is join.awk
BEGIN {
FS = OFS = ";"
print "USER", "Client Name", "strnu", "BINin", "TYPE"
}
FNR == 1 {next}
NR == FNR {
strnu[$1] = $2
next
}
{
for (client in strnu) {
strnu_pattern = strnu[client]
gsub(/-/, "", strnu_pattern)
if ($1 ~ client && $1 ~ strnu_pattern) {
print $1, client, strnu[client], $2, $3
break
}
}
}
and then
awk -f join.awk DATA1.csv DATA2.csv
outputs
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N
Assumptions/understandings:
ignore lines from DATA1.csv where the fav field is not one of V T or V PO
when matching fields we need to ignore the any hyphens from the DATA1.csv fields
when matching fields the strings from DATA1.csv can show up in either order in DATA2.csv
last line of the expected output show end with 635,N
One `awk idea:
awk '
BEGIN { FS=OFS=";"
print "USER","Client Name","strnu","BINin","TYPE" # print new header
}
FNR==1 { next } # skip input headers
FNR==NR { if ($4 == "V PO" || $4 == "V T") { # only process if fav is one of "V PO" or "V T"
cnames[FNR]=$1 # save client name
strnus[FNR]=$2 # save strnu
}
next
}
{ for (i in cnames) { # loop through array indices
cname=cnames[i] # make copy of client name ...
strnu=strnus[i] # and strnu so that we can ...
gsub(/-/,"",cname) # strip hypens from both ...
gsub(/-/,"",strnu) # in order to perform the comparisons ...
if (index($1,cname) && index($1,strnu)) { # if cname and strnu both exist in $1 then index()>=1 in both cases so ...
print $1,cnames[i],strnus[i],$2,$3 # print to stdout
next # we found a match so break from loop and go to next line of input
}
}
}
' DATA1.csv DATA2.csv
This generates:
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N

AWK print block that does NOT contain specific text

I have the following data file:
variable "ARM_CLIENT_ID" {
description = "Client ID for Service Principal"
}
variable "ARM_CLIENT_SECRET" {
description = "Client Secret for Service Principal"
}
# [.....loads of code]
variable "logging_settings" {
description = "Logging settings from TFVARs"
}
variable "azure_firewall_nat_rule_collections" {
default = {}
}
variable "azure_firewall_network_rule_collections" {
default = {}
}
variable "azure_firewall_application_rule_collections" {
default = {}
}
variable "build_route_tables" {
description = "List of Route Table keys that need direct internet prior to Egress FW build"
default = [
"shared_services",
"sub_to_afw"
]
}
There's a 2 things I wish to do:
print the variable names without the inverted commas
ONLY print the variables names if the code block does NOT contain default
I know I can print the variables names like so: awk '{ gsub("\"", "") }; (/variable/ && $2 !~ /^ARM_/) { print $2}'
I know I can print the code blocks with: awk '/variable/,/^}/', which results:
# [.....loads of code output before this]
variable "logging_settings" {
description = "Logging settings from TFVARs"
}
variable "azure_firewall_nat_rule_collections" {
default = {}
}
variable "azure_firewall_network_rule_collections" {
default = {}
}
variable "azure_firewall_application_rule_collections" {
default = {}
}
variable "build_route_tables" {
description = "List of Route Table keys that need direct internet prior to Egress FW build"
default = [
"shared_services",
"sub_to_afw"
]
}
However, I cannot find out how to print the code blocks "if" they don't contain default. I know I will need to use an if statement, and some variables perhaps, but I am unsure as of how.
This code block should NOT appear in the output for which I grab the variable name:
variable "build_route_tables" {
description = "List of Route Table keys that need direct internet prior to Egress FW build"
default = [
"shared_services",
"sub_to_afw"
]
}
End output should NOT contain those that had default:
# [.....loads of code output before this]
expressroute_settings
firewall_settings
global_settings
peering_settings
vnet_transit_object
vnet_shared_services_object
route_tables
logging_settings
Preferable I would like to keep this a single AWK command or file, no piping. I have uses for this that do prefer no piping.
EDIT: update the ideal outputs (missed some examples of those with default)
Assumptions and collection of notes from OP's question and comments:
all variable definition blocks end with a right brace (}) in the first column of a new line
we only display variable names (sans the double quotes)
we do not display the variable names if the body of the variable definition contains the string default
we do not display the variable name if it starts with the string ARM_
One (somewhat verbose) awk solution:
NOTE: I've copied the sample input data into my local file variables.dat
awk -F'"' ' # use double quotes as the input field separator
/^variable / && $2 !~ "^ARM_" { varname = $2 # if line starts with "^variable ", and field #2 is not like "^ARM_", save field #2 for later display
printme = 1 # enable our print flag
}
/variable/,/^}/ { if ( $0 ~ "default" ) # within the range of a variable definition, if we find the string "default" ...
printme = 0 # disable the print flag
next # skip to next line
}
printme { print varname # if the print flag is enabled then print the variable name and then ...
printme = 0 # disable the print flag
}
' variables.dat
This generates:
logging_settings
$ awk -v RS= '!/default =/{gsub(/"/,"",$2); print $2}' file
ARM_CLIENT_ID
ARM_CLIENT_SECRET
[.....loads
logging_settings
of course output doesn't match yours since it's inconsistent with the input data.
Using GNU awk:
awk -v RS="}" '/variable/ && !/default/ && !/ARN/ { var=gensub(/(^.*variable ")(.*)(".*{.*)/,"\\2",$0);print var }' file
Set the record separator to "}" and then check for records that contain "variable", don't contain default and don't contain "ARM". Use gensub to split the string into three sections based on regular expressions and set the variable var to the second section. Print the var variable.
Output:
logging_settings
Another variation on awk using skip variable to control the array index holding the variable names:
awk '
/^[[:blank:]]*#/ { next }
$1=="variable" { gsub(/["]/,"",$2); vars[skip?n:++n]=$2; skip=0 }
$1=="default" { skip=1 }
END { if (skip) n--; for(i=1; i<=n; i++) print vars[i] }
' code
The first rule just skips comment lines. If you want to skip "ARM_" variables, then you can add a test on $2.
Example Use/Output
With your example code in code, all variables without default are:
$ awk '
> /^[[:blank:]]*#/ { next }
> $1=="variable" { gsub(/["]/,"",$2); vars[skip?n:++n]=$2; skip=0 }
> $1=="default" { skip=1 }
> END { if (skip) n--; for(i=1; i<=n; i++) print vars[i] }
> ' code
ARM_CLIENT_ID
ARM_CLIENT_SECRET
logging_settings
Here's another maybe shorter solution.
$ awk -F'"' '/^variable/&&$2!~/^ARM_/{v=$2} /default =/{v=0} /}/&&v{print v; v=0}' file
logging_settings

How to match string and print lines within curly braces { } from config file

I want to print the lines within { and } with assign where "mango" in Hostgroups
object Host "os.google.com" {
import "windows"
address = "linux.google.com"
groups = ["linux"]
}
object Host "mango.google.com" {
import "windows"
address = "mango.google.com"
groups = ["linux"]
assign where "mango" in Hostgroups
}
Desired output:
object Host "mango.google.com" {
import "windows"
address = "mango.google.com"
groups = ["linux"]
assign where "mango" in Hostgroups
}
Try this awk script
script.awk
/{/,/}/ { #define record range from { to }
if ($0 ~ "{") rec = $0; # if record opening reset rec variable with current line
else rec = rec "\n" $0; # else accumulate the current line in rec
if ($0 ~ /assign where "mango" in Hostgroups/) { # if found exit pattern in current line
print rec; # print the rec
exit; # terminate
}
}
executions:
awk -f script.awk input.txt
output:
object Host "mango.google.com" {
import "windows"
address = "mango.google.com"
groups = ["linux"]
assign where "mango" in Hostgroups
This might work for you (GNU sed):
sed -n '/{/h;//!H;/}/{g;/assign where "mango" in Hostgroups/p}' file
Turn off seds automatic printing using the -n option and gather up lines in the hold space between curly braces. Following the closing curly brace, replace it with the contents of the hold space and if there is a match for assign where "mango" in Hostgroup print it.
Assuming } doesn't appear in any other context in your input:
$ awk -v RS='}' '
/assign where "mango" in Hostgroups/ {
sub(/^[[:space:]]+\n/,"")
print $0 RS
}
' file
object Host "mango.google.com" {
import "windows"
address = "mango.google.com"
groups = ["linux"]
assign where "mango" in Hostgroups
}

awk - Merge two files by matching columns and append the columns values of second file to the line

I have an unusual merge request in awk. Hoping you could help.
File1
pl1,prop1,20
pl1,prop2,30
pl1,prop3,40
pl2,prop1,70
pl2,prop2,80
pl2,prop3,90
pl3,prop1,120
pl3,prop2,130
pl3,prop3,140
File2
store1,pl1
store2,pl1
store3,pl2
store4,pl3
store5,pl2
store6,pl1
Output:
prop1, store1-20, store2-20, store3-70, store4-120, store5-70, store6-20
prop2, store1-30, store2-30, store3-80, store4-130, store5-80, store6-30
prop3, store1-40, store2-40, store3-90, store4-140, store5-90, store6-40
Rules
file1.column1 should match file2.column2
for all matching lines - file2.column1 should be concatenated with file1.currentLine.column3 should be appended
Many thanks,
I'm assuming those blank lines are not actually in your input files.
Using GNU awk which has true arrays of arrays:
gawk -F, '
NR==FNR { prop[$2][$1] = $3; next }
{ pl[$2][$1] = 1 }
END {
for (key in prop) {
printf "%s", key;
for (subkey in prop[key]) {
for (store in pl[subkey]) {
printf ", %s-%d", store, prop[key][subkey]
}
}
print ""
}
}
' File1 File2
prop1, store1-20, store2-20, store6-20, store3-70, store5-70, store4-120
prop2, store1-30, store2-30, store6-30, store3-80, store5-80, store4-130
prop3, store1-40, store2-40, store6-40, store3-90, store5-90, store4-140

How to get specific data from block of data based on condition

I have a file like this:
[group]
enable = 0
name = green
test = more
[group]
name = blue
test = home
[group]
value = 48
name = orange
test = out
There may be one ore more space/tabs between label and = and value.
Number of lines may wary in every block.
I like to have the name, only if this is not true enable = 0
So output should be:
blue
orange
Here is what I have managed to create:
awk -v RS="group" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
There are several fault with this:
I am not able to set RS to [group], both this fails RS="[group]" and RS="\[group\]". This will then fail if name or other labels contains group.
I do prefer not to use RS with multiple characters, since this is gnu awk only.
Anyone have other suggestion? sed or awk and not use a long chain of commands.
If you know that groups are always separated by empty lines, set RS to the empty string:
$ awk -v RS="" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
#devnull explained in his answer that GNU awk also accepts regular expressions in RS, so you could only split at [group] if it is on its own line:
gawk -v RS='(^|\n)[[]group]($|\n)' '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
This makes sure we're not splitting at evil names like
[group]
enable = 0
name = [group]
name = evil
test = more
Your problem seems to be:
I am not able to set RS to [group], both this fails RS="[group]" and
RS="\[group\]".
Saying:
RS="[[]group[]]"
should yield the desired result.
In these situations where there's clearly name = value statements within a record, I like to first populate an array with those mappings, e.g.:
map["<name>"] = <value>
and then just use the names to reference the values I want. In this case:
$ awk -v RS= -F'\n' '
{
delete map
for (i=1;i<=NF;i++) {
split($i,tmp,/ *= */)
map[tmp[1]] = tmp[2]
}
}
map["enable"] !~ /^0$/ {
print map["name"]
}
' file
blue
orange
If your version of awk doesn't support deleting a whole array then change delete map to split("",map).
Compared to using REs and/or sub()s., etc., it makes the solution much more robust and extensible in case you want to compare and/or print the values of other fields in future.
Since you have line-separated records, you should consider putting awk in paragraph mode. If you must test for the [group] identifier, simply add code to handle that. Here's some example code that should fulfill your requirements. Run like:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
RS=""
}
{
for (i=2; i<=NF; i+=3) {
if ($i == "enable" && $(i+2) == 0) {
f = 1
}
if ($i == "name") {
r = $(i+2)
}
}
}
!(f) && r {
print r
}
{
f = 0
r = ""
}
Results:
blue
orange
This might work for you (GNU sed):
sed -n '/\[group\]/{:a;$!{N;/\n$/!ba};/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p;d}' file
Read the [group] block into the pattern space then substitute out the colour if the enable variable is not set to 0.
sed -n '...' set sed to run in silent mode, no ouput unless specified i.e. a p or P command
/\[group\]/{...} when we have a line which contains [group] do what is found inside the curly braces.
:a;$!{N;/\n$/!ba} to do a loop we need a place to loop to, :a is the place to loop to. $ is the end of file address and $! means not the end of file, so $!{...} means do what is found inside the curly braces when it is not the end of file. N means append a newline and the next line to the current line and /\n$/ba when we have a line that ends with an empty line branch (b) to a. So this collects all lines from a line that contains `[group] to an empty line (or end of file).
/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p if the lines collected contain enable = 0 then do not substitute out the colour. Or to put it another way, if the lines collected so far do not contain enable = 0 do substitute out the colour.
If you don't want to use the record separator, you could use a dummy variable like this:
#!/usr/bin/awk -f
function endgroup() {
if (e == 1) {
print n
}
}
$1 == "name" {
n = $3
}
$1 == "enable" && $3 == 0 {
e = 0;
}
$0 == "[group]" {
endgroup();
e = 1;
}
END {
endgroup();
}
You could actually use Bash for this.
while read line; do
if [[ $line == "enable = 0" ]]; then
n=1
else
n=0
fi
if [ $n -eq 0 ] && [[ $line =~ name[[:space:]]+=[[:space:]]([a-z]+) ]]; then
echo ${BASH_REMATCH[1]}
fi
done < file
This will only work however if enable = 0 is always only one line above the line with name.

Resources