Bash script to compare and generate csv datafile - bash

I have two CSV files data1.csv and data2.csv the content is something like this (with headers) :
DATA1.csv
Client Name;strnu;addr;fav
MAD01;HDGF;11;V PO
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V ZZ
DATA2.csv
USER;BINin;TYPE
XXMAD01XXXHDGFXX;11;N
KJDGD;635;M
CVOJF01XXHHD;635;N
Issues :
The value of the 1st and 2nd column of DATA1.csv exist randomly in the first column of DATA2.csv.
For example MAD01;HDGF exist in the first column of DATA2 ***MAD01***HDGF** (* can be alphanum and/or symbols charachter) and MAD01;HDGF might not be in the same order in the column USER of DATA2.
The value of strnum in DATA1 is equal to the value of the column BINin in DATA2
The column fav DATA1 is the same as TYPE in DATA2 because V T = M and V PO = N (some other valuses may exist but we won't need them for example line 3 of DATA1 it should be ignored)
N.B: some data may exist in a file but not the other.
my bash script needs to generate a new CSV file that should contain:
The column USER from DATA2
Client Name and strnu from DATA1
BINin from DATA2 only if it's equal to the corespondent line and value of strnu DATA1
TYPE using M N Format and making sure to respect the condition that V T = M and V PO = N
The first thing i tried was usuing grep to search for lines that exist in both files
#!/bin/sh
DATA1="${1}"
DATA2="${2}"
for i in $(cat $DATA1 | awk -F";" '{print $1".*"$2}' | sed 1d) ; do
grep "$i" $DATA2
done
Result :
$ ./script.sh DATA1.csv DATA2.csv
MAD01;HDGF;11;V PO
XXMAD01XXXHDGFXX;11;N
CVOJF01;HHD-;635;V T
LINKO10;DH--JDH;98;V PO
Using grep and awk i could find lines that are present in DATA1 and DATA2 files but it doesn't work for all the lines and i guess it's because of the - and other special characters present in column 2 of DATA1 but they can be ignored.
I don't know how i can generate a new csv that would mix the lines present in both files but the expected generated CSV should look like this
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;M

This can be done in a single awk program. This is join.awk
BEGIN {
FS = OFS = ";"
print "USER", "Client Name", "strnu", "BINin", "TYPE"
}
FNR == 1 {next}
NR == FNR {
strnu[$1] = $2
next
}
{
for (client in strnu) {
strnu_pattern = strnu[client]
gsub(/-/, "", strnu_pattern)
if ($1 ~ client && $1 ~ strnu_pattern) {
print $1, client, strnu[client], $2, $3
break
}
}
}
and then
awk -f join.awk DATA1.csv DATA2.csv
outputs
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N

Assumptions/understandings:
ignore lines from DATA1.csv where the fav field is not one of V T or V PO
when matching fields we need to ignore the any hyphens from the DATA1.csv fields
when matching fields the strings from DATA1.csv can show up in either order in DATA2.csv
last line of the expected output show end with 635,N
One `awk idea:
awk '
BEGIN { FS=OFS=";"
print "USER","Client Name","strnu","BINin","TYPE" # print new header
}
FNR==1 { next } # skip input headers
FNR==NR { if ($4 == "V PO" || $4 == "V T") { # only process if fav is one of "V PO" or "V T"
cnames[FNR]=$1 # save client name
strnus[FNR]=$2 # save strnu
}
next
}
{ for (i in cnames) { # loop through array indices
cname=cnames[i] # make copy of client name ...
strnu=strnus[i] # and strnu so that we can ...
gsub(/-/,"",cname) # strip hypens from both ...
gsub(/-/,"",strnu) # in order to perform the comparisons ...
if (index($1,cname) && index($1,strnu)) { # if cname and strnu both exist in $1 then index()>=1 in both cases so ...
print $1,cnames[i],strnus[i],$2,$3 # print to stdout
next # we found a match so break from loop and go to next line of input
}
}
}
' DATA1.csv DATA2.csv
This generates:
USER;Client Name;strnu;BINin;TYPE
XXMAD01XXXHDGFXX;MAD01;HDGF;11;N
CVOJF01XXHHD;CVOJF01;HHD-;635;N

Related

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

awk or other shell to convert delimited list into a table

So what I have is a huge csv akin to this:
Pool1,Shard1,Event1,10
Pool1,Shard1,Event2,20
Pool1,Shard2,Event1,30
Pool1,Shard2,Event4,40
Pool2,Shard1,Event3,50
etc
Which is not ealisy readable. Eith there being only 4 types of events I'm useing spreadsheets to convert this into the following:
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,
Only events are limited to 4, pools and shards can be indefinite really. But the events may be missing from the lines - not all pools/shards have all 4 events every day.
So I tried doing this within an awk in the shell script that gathers the csv in the first place, but I'm failing spectacuraly, no working code can even be shown since it's producing zero results.
Basically I tried sorting the CSV reading the first two fields of a row, comparing to previous row and if matching comparing the third field to a set array of event strings then storing the fouth field in a variable respective to the event, and one the first two fileds are not matching - finally print the whole line including variables.
Sorry for the one-liner, testing and experimenting directly in the command line. It's embarassing, it does nothing.
awk -F, '{if (a==$1&&b==$2) {if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4}} else {printf $a","$b","$r","$d","$p","$t"\n"; a=$1 ; b=$2 ; if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4} ; a=$1; b=$2}} END {printf "\n"}'
You could simply use an assoc array: awk -F, -f parse.awk input.csv with parse.awk being:
{
sub(/Event/, "", $3);
res[$1","$2][$3]=$4;
}
END {
for (name in res) {
printf("%s,%s,%s,%s,%s\n", name, res[name][1], res[name][2], res[name][3], res[name][4])
}
}
Order could be confused by awk, but my test output is:
Pool2,Shard1,,,50,
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
PS: Please use an editor to write awk source code. Your one-liner is really hard to read. Since I used a different approach, I did not even try do get it "right"... ;)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $1 OFS $2 }
key != prev {
if ( NR>1 ) {
print prev, f["Event1"], f["Event2"], f["Event3"], f["Event4"]
delete f
}
prev = key
}
{ f[$3] = $4 }
END { print key, f["Event1"], f["Event2"], f["Event3"], f["Event4"] }
$ sort file | awk -f tst.awk
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,

Split CSV into two files based on column matching values in an array in bash / posh

I have a input CSV that I would like to split into two CSV files. If the value of column 4 matches any value in WLTarray it should go in output file 1, if it doesn't it should go in output file 2.
WLTarray:
"22532" "79994" "18809" "21032"
input CSV file:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file1:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file2:
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
I've been looking at awk to filter this (python & perl not an option in my environment) but I think there is probably a much smarter way:
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}" #Everything in the WLTarray will go to $filename-WLT.tmp
do
awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
# now filter to remove non matching values? why not just move the rows entirely?
done
With regular awk you can make use of split and substr (to handle double-quote removal for comparison) and split the csv file as you indicate. For example you can use:
awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
split (s,a," ") # split s into array a
for (i in a) # loop over each index in a
b[a[i]]=1 # use value in a as index for b
}
FNR == 1 { # first record, write header to both output files
print $0 > "output1.csv"
print $0 > "output2.csv"
next
}
substr($4,2,length($4)-2) in b { # 4th field w/o quotes in b?
print $0 > "output1.csv" # write to output1.csv
next
}
{ print $0 > "output2.csv" } # otherwise write to output2.csv
' input.csv
Where:
in the BEGIN {...} rule you set the field separator (FS) to break on comma, and split the string containing your desired output1.csv field 4 matches into the array a, then loops over the values in a using them for the indexes in array b (to allow a simple i in b check);
the first rule is applied to the first records in the file (the header line) which is simply written out to both output files;
the next rule removes the double-quotes surrounding field-4 and then checks if the number in field-4 matches an index in array b. If so the record is written to output1.csv otherwise it is written to output2.csv.
Example Input File
$ cat input.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
Resulting Output Files
$ cat output1.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
$ cat output2.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
You can use gawk like this:
test.awk
#!/usr/bin/gawk -f
BEGIN {
split("22532 79994 18809 21032", a)
for(i in a) {
WLTarray[a[i]]
}
FPAT="[^\",]+"
}
NR > 1 {
if ($4 in WLTarray) {
print >> "output1.csv"
} else {
print >> "output2.csv"
}
}
Make it executable and run it like this:
chmod +x test.awk
./test.awk input.csv
using grep with a filter file as input was the simplest answer.
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}"
do
awkstring="'\$4 == "\"\\\"$WLTvalue\\\"\"" {print}'"
eval "awk -F, $awkstring input.csv >> output.WLT.csv"
done
grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv

Append data to another column in a CSV if duplicate is found in first column

I have a CSV with data such as:
somename1,value1
somename1,value2
somename1,value3
anothername1,anothervalue1
anothername1,anothervalue2
anothername1,anothervalue3
I would like to rewrite the CSV so that when a duplicate in column 1 is found, the the data is appended to a new column on the first entry.
For instance, the desired output would be :
somename1,value1,value2,value3
anothername1,anothervalue1,anothervalue2,anothervalue3
How can i do this in a shell script ?
TIA
You need much more than just removing duplicated lines when using Awk, you need a logic as below to create an array of elements for each unique entry in $1.
The solution creates a hash-map with unique values in $1 working as indices of the array and elements as the value appended with a , separator.
awk 'BEGIN{FS=OFS=","; prev="";}{ if (prev != $1) {unique[$1]=$2;} else {unique[$1]=(unique[$1]","$2)} prev=$1; }END{for (i in unique) print i,unique[i]}' file
anothername1,anothervalue1,anothervalue2,anothervalue3
somename1,value1,value2,value3
A more readable version would be to have something like,
BEGIN {
# set input and output field separator to ',' and initialize
# variable holding last instance of $1 to empty
FS=OFS=","
prev=""
}
{
# Update the value of $2 directly in the hash array only when new
# unique elements are found in $1
if (prev != $1){
unique[$1]=$2
}
else {
unique[$1]=(unique[$1]","$2)
}
# Update the current $1
prev=$1
}
END {
for (i in unique) {
print i,unique[i]
}
FILE=$1
NAMES=`cut -d',' -f 1 $FILE | sort -u`
for NAME in $NAMES; do
echo -n "$NAME"
VALUES=`grep "$NAME" $FILE | cut -d',' -f2`
for VAL in $VALUES; do
echo -n ",$VAL"
done
echo ""
done
running with your data generates:
>bash script.sh data1.txt
anothername1,anothervalue1,anothervalue2,anothervalue3
somename1,value1,value2,value3
the filename of your data has to be passed as parameter. output can be written to a new file by redirecting.
>bash script.sh data1.txt > data_new.txt

How to get specific data from block of data based on condition

I have a file like this:
[group]
enable = 0
name = green
test = more
[group]
name = blue
test = home
[group]
value = 48
name = orange
test = out
There may be one ore more space/tabs between label and = and value.
Number of lines may wary in every block.
I like to have the name, only if this is not true enable = 0
So output should be:
blue
orange
Here is what I have managed to create:
awk -v RS="group" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
There are several fault with this:
I am not able to set RS to [group], both this fails RS="[group]" and RS="\[group\]". This will then fail if name or other labels contains group.
I do prefer not to use RS with multiple characters, since this is gnu awk only.
Anyone have other suggestion? sed or awk and not use a long chain of commands.
If you know that groups are always separated by empty lines, set RS to the empty string:
$ awk -v RS="" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
#devnull explained in his answer that GNU awk also accepts regular expressions in RS, so you could only split at [group] if it is on its own line:
gawk -v RS='(^|\n)[[]group]($|\n)' '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
This makes sure we're not splitting at evil names like
[group]
enable = 0
name = [group]
name = evil
test = more
Your problem seems to be:
I am not able to set RS to [group], both this fails RS="[group]" and
RS="\[group\]".
Saying:
RS="[[]group[]]"
should yield the desired result.
In these situations where there's clearly name = value statements within a record, I like to first populate an array with those mappings, e.g.:
map["<name>"] = <value>
and then just use the names to reference the values I want. In this case:
$ awk -v RS= -F'\n' '
{
delete map
for (i=1;i<=NF;i++) {
split($i,tmp,/ *= */)
map[tmp[1]] = tmp[2]
}
}
map["enable"] !~ /^0$/ {
print map["name"]
}
' file
blue
orange
If your version of awk doesn't support deleting a whole array then change delete map to split("",map).
Compared to using REs and/or sub()s., etc., it makes the solution much more robust and extensible in case you want to compare and/or print the values of other fields in future.
Since you have line-separated records, you should consider putting awk in paragraph mode. If you must test for the [group] identifier, simply add code to handle that. Here's some example code that should fulfill your requirements. Run like:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
RS=""
}
{
for (i=2; i<=NF; i+=3) {
if ($i == "enable" && $(i+2) == 0) {
f = 1
}
if ($i == "name") {
r = $(i+2)
}
}
}
!(f) && r {
print r
}
{
f = 0
r = ""
}
Results:
blue
orange
This might work for you (GNU sed):
sed -n '/\[group\]/{:a;$!{N;/\n$/!ba};/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p;d}' file
Read the [group] block into the pattern space then substitute out the colour if the enable variable is not set to 0.
sed -n '...' set sed to run in silent mode, no ouput unless specified i.e. a p or P command
/\[group\]/{...} when we have a line which contains [group] do what is found inside the curly braces.
:a;$!{N;/\n$/!ba} to do a loop we need a place to loop to, :a is the place to loop to. $ is the end of file address and $! means not the end of file, so $!{...} means do what is found inside the curly braces when it is not the end of file. N means append a newline and the next line to the current line and /\n$/ba when we have a line that ends with an empty line branch (b) to a. So this collects all lines from a line that contains `[group] to an empty line (or end of file).
/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p if the lines collected contain enable = 0 then do not substitute out the colour. Or to put it another way, if the lines collected so far do not contain enable = 0 do substitute out the colour.
If you don't want to use the record separator, you could use a dummy variable like this:
#!/usr/bin/awk -f
function endgroup() {
if (e == 1) {
print n
}
}
$1 == "name" {
n = $3
}
$1 == "enable" && $3 == 0 {
e = 0;
}
$0 == "[group]" {
endgroup();
e = 1;
}
END {
endgroup();
}
You could actually use Bash for this.
while read line; do
if [[ $line == "enable = 0" ]]; then
n=1
else
n=0
fi
if [ $n -eq 0 ] && [[ $line =~ name[[:space:]]+=[[:space:]]([a-z]+) ]]; then
echo ${BASH_REMATCH[1]}
fi
done < file
This will only work however if enable = 0 is always only one line above the line with name.

Resources