output the rows with one non-empty column in csv using bash - bash

Given a csv file, I want to output only the rows with exactly one non-empty column.
input file
"a","b","c"
"d","",""
output:
"d","",""
Can this be done in bash?

A simpler awk solution can be
$ awk '/^("",)*"."(,"")*$/' inputFile
"d","",""
What it does
/^("",)*"."(,"")*$/ patterns matches as
("",) number of empty columns
"." followed by ONE non empty column
(,"") further followed by number of empty columns
no action specified, hence takes the default action to print the entire record
EDIT
If there are more than one letter in a column
$ awk '/^("",)*"[^"]+"(,"")*$/' input
"d","",""
Thanks to Jotne

You can use sed for this:
sed -n '/^[",]*[^",]*[",]*$/p' file
To make sure it does not match blank lines we can add the +:
sed -n '/^[",]*"[^",]\+"[",]*$/p' file
It returns:
"d","",""
It is a matter of checking if there is one, and just one, block characters different than " or , in between these characters. -n inhibits the printing, whereas p prints the lines that accomplish the condition.

You could use gsub() to count the number of times an empty field is found, then subtract from NF and test equal to one. Here's one way using GNU AWK and the FPAT variable:
awk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" } NF - gsub(/""/, "&") == 1' file
If you don't have embedded commas, you could simply write:
awk -F, 'NF - gsub(/""/, "&") == 1' file

A simplistic approach which assumes that no fields in the CSV file contain commas:
awk -F '[",]+' '{n=0;for(i=2;i<NF;++i)$i~/^$/||++n}n==1' file.txt
Set the input field separator to one or more double quotes and commas. Loop through all of the fields, incrementing n for every non-empty field. If the total number is exactly 1, print the line.
The reason that the loop goes from field 2 to NF-1 is that the first and last field are before and after the parts that you are interested in.
Very similar but ever-so-slightly shorter:
awk -F ',' '{n=0;for(i=1;i<=NF;++i)$i~/""/||++n}n==1' file.txt
Use the comma as the field separator and increment n for any fields that contain "". In this case, the loop goes through each field.

Through sed.
$ sed -rn '/^(".[^"]*"(,"")*|""(,"")*,".[^"]*"(,"")*)$/p' file
"d","",""
First part ".[^"]*"(,"")* matches these type of string "A","","" where the second part ""(,"")*,".[^"]*"(,"")* would match these type of string formats "","","A"
Example:
$ cat file
"a","b","c"
"d","",""
"","","A"
"A","","A"
"","A",""
"","A","A"
"A","A",""
$ sed -rn '/^(".[^"]*"(,"")*|""(,"")*,".[^"]*"(,"")*)$/p' file
"d","",""
"","","A"
"","A",""

This grep should be able to handle this:
grep -E '^("",)*"[^"]+"(,"")*$' file
"d","",""

Just split the line into fields and count how many are non-empty:
$ awk -F'^"|","|"$' '{c=0; for (i=2; i<NF; i++) if ($i != "") ++c} c==1' file
"d","",""
The loop starts at 2 and ends at NF-1 because there's no point checking the empty fields that will always exist before the first and after the last "real" fields (i.e. before the ^" and after the "$) when the line is split using an FS that includes the start-of-string (^) and end-of-string ($) RE metacharacters.
If you ever wanted to check different counts of non-empty fields, just change the number you compare c to:
$ cat file
"a","b","c"
"d","",""
"e","","f"
"","",""
.
$ awk -F'^"|","|"$' '{c=0; for (i=2; i<NF; i++) if ($i != "") ++c} c==0' file
"","",""
$ awk -F'^"|","|"$' '{c=0; for (i=2; i<NF; i++) if ($i != "") ++c} c==1' file
"d","",""
$ awk -F'^"|","|"$' '{c=0; for (i=2; i<NF; i++) if ($i != "") ++c} c==2' file
"e","","f"
$ awk -F'^"|","|"$' '{c=0; for (i=2; i<NF; i++) if ($i != "") ++c} c==3' file
"a","b","c"
$ awk -F'^"|","|"$' '{c=0; for (i=2; i<NF; i++) if ($i != "") ++c} c==4' file
$

Related

change numerical value in file to characters via awk

I'm looking to replace the numerical values in a file with a new value provided by me. Can be present in any part of the text, in some cases, it comes across as the third position but is not always necessarily the case. Also to try and save a new version of the file.
original format
A:fdg:user#server:r
A:g:1234:xtcy
A:d:1111:xtcy
modified format
A:fdg:user#server:rxtTncC
A:g:replaced_value:xtcy
A:d:replaced_value:xtcy
bash line command with awk:
awk -v newValue="newVALUE" 'BEGIN{FS=OFS=":"} /:.:.*:/ && ~/^[0-9]+$/{~=newValue} 1' original_file.txt > replaced_file.txt
You can simply use sed instead of awk:
sed -E 's/\b[0-9]+\b/replaced_value/g' /path/to/infile > /path/to/outfile
Here is an awk that asks you for replacement values for each numerical value it meets:
$ awk '
BEGIN {
FS=OFS=":" # delimiters
}
{
for(i=1;i<=NF;i++) # loop all fields
if($i~/^[0-9]+$/) { # if numerical value found
printf "Provide replacement value for %d: ",$i > "/dev/stderr"
getline $i < "/dev/stdin" # ask for a replacement
}
}1' file_in > file_out # write output to a new file
I would use GNU AWK for this task following way, let file.txt content be
A:fdg:user#server:rxtTncC
A:g:1234:xtcy
A:d:1111:xtcy
then
awk 'BEGIN{newvalue="replacement"}{gsub(/[[:digit:]]+/,newvalue);print}' file.txt
output
A:fdg:user#server:rxtTncC
A:g:replacement:xtcy
A:d:replacement:xtcy
Explanation: replace one or more digits using newvalue. Disclaimer: I assumed numeric is something consisting solely from digits.
(tested in gawk 4.2.1)
How about
awk -F : '$3 ~ /^[0-9]+$/ { $3 = "new value"} {print}' original_file >replaced_file
?

How to print keys from all key-value pairs

Text file looks like this:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
How can I extract keys so that:
key11|key12|key13
key21|key22|key23
I have tried unsuccessfully :
awk '{ gsub(/[^[|]=]+=/,"") }1' file.txt
gives back the actual data:
key11=val1|key12=val2|key13=val3
key21=val1|key22=val2|key23=val3
Since you tagged bash
while IFS='=|' read -ra words; do
n=${#words[#]}
for ((i=1; i<n; i+=2)); do
unset words[i]
done
( IFS='|'; echo "${words[*]}" )
done < file
gawk
This can be done by awk, by setting FS and OFS :
kent$ awk -F'=[^|]*' -v OFS="" '$1=$1' file
key11|key12|key13
key21|key22|key23
or safer: awk -F.... '{$1=$1}1' file
substitution (by sed for example):
kent$ sed 's/=[^|]*//g' file
key11|key12|key13
key21|key22|key23
Here's one solution
echo "key11=val1|key12=val2|key13=val3" \
| awk -F'[=|]' '{
for (i=1;i<=NF;i+=2){
printf("%s%s", $i, (i<(NF-1))?"|":"")
}
print""
}'
output
key11|key12|key13
It should also work by passing in the filename as an argument to awk, i.e.
awk -F'[=|]' '{for (i=1;i<=NF;i+=2){printf("%s%s", $i, (i<(NF-1))?"|":"") }print""}' file1 [file_more_as_will_fit]
Discussion
We use a multiple character value for FS (FieldSeperator) so each = and | char mark the beginning of a new field.
-F'[=|]'
Because we know we want to start with field1 for output and skip every other field, we use
for (i=1;i<=NF;i+=2)
printf formats the output as defined by the format string '%s%s' . There area a zillion options available for printf format strs, but you only need the value for $i (the looping value that generates the key) and whether to print a | char or not.
printf("%s%s", $i ...)
And we use awk's ternary operator, which evaluates what element number is being processed (i<..). As long as it is not the 2nd to last field, the | char is emitted.
(i<(NF-1))?"|":""
IHTH
sed
I did this with sed:
sed -r 's/([[:alnum:]]*)=[[:alnum:]]*/\1/g' < file.txt
tested here and got:
key11|key12|key13
key21|key22|key23
s/<pattern>/<subst>/ means "replace <pattern> by <subst>", and with the g in the end it will do it for every pattern found in the line.
The [[:alnum:]]* is equivalent to [0-9a-zA-Z]*, and means any number of letters or digits.
The first pattern between parentesis will correspond to \1 in the substitution, the second \2 and so on.
So, it will match every "key=value" and replace it by "key".
awk -F'[=|]' '{print $1,$3,$5}' OFS="|" file
key11|key12|key13
key21|key22|key23

Ignore delimiters in quotes and excluding columns dynamically in csv file

I have awk command to read the csv file with | sperator. I am using this command as part of my shell script where the columns to exclude will be removed from the output. The list of columns are input as 1 2 3
Command Reference: http://wiki.bash-hackers.org/snipplets/awkcsv
awk -v FS='"| "|^"|"$' '{for i in $test; do $(echo $i=""); done print }' test.csv
$test is 1 2 3
I want to print $1="" $2="" $3="" in front of print all columns. I am getting this error
awk: {for i in $test; do $(echo $i=""); done {print }
awk: ^ syntax error
This command is working properly which prints all the columns
awk -v FS='"| "|^"|"$' '{print }' test.csv
File 1
"first"| "second"| "last"
"fir|st"| "second"| "last"
"firtst one"| "sec|ond field"| "final|ly"
Expected output if I want to exclude the column 2 and 3 dynamically
first
fir|st
firtst one
I need help to keep the for loop properly.
With GNU awk for FPAT:
$ awk -v FPAT='"[^"]+"' '{print $1}' file
"first"
"fir|st"
"firtst one"
$ awk -v flds='1' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"first"
"fir|st"
"firtst one"
$ awk -v flds='2 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"second" "last"
"second" "last"
"sec|ond field" "final|ly"
$ awk -v flds='3 1' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' file
"last" "first"
"last" "fir|st"
"final|ly" "firtst one"
If you don't want your output fields separated by a blank char then set OFS to whatever you do want with -v OFS='whatever'. If you want to get rid of the surrounding quotes you can use gensub() (since we're using gawk anyway) or substr() on every field, e.g.:
$ awk -v OFS=';' -v flds='1 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", substr($(f[i]),2,length($(f[i]))-2), (i<n?OFS:ORS)}' file
first;last
fir|st;last
firtst one;final|ly
$ awk -v OFS=';' -v flds='1 3' -v FPAT='"[^"]+"' 'BEGIN{n=split(flds,f,/ /)} {for (i=1;i<=n;i++) printf "%s%s", gensub(/"/,"","g",$(f[i])), (i<n?OFS:ORS)}' file
first;last
fir|st;last
firtst one;final|ly
In GNU awk (for FPAT):
$ test="2 3" # fields to exclude in bash var $test
$ awk -v t="$test" ' # taken to awk var t
BEGIN { # first
FPAT="([^|]+)|( *\"[^\"]+\")" # instead of FS, use FPAT
split(t,a," ") # process t to e:
for(i in a) # a[1]=2 -> e[2], etc.
e[a[i]]
}
{
for(i=1;i<=NF;i++) # for each field
if((i in e)==0) { # if field # not in e
gsub(/^\"|\"$/,"",$i) # remove leading and trailing "
b=b (b==""?"":OFS) $i # put to buffer b
}
print b; b="" # putput and reset buffer
}' file
first
fir|st
firtst one
FPAT is used as FS can't handle separator in quotes.
Vikram, if your actual Input_file is DITTO same as shown sample Input_file then following may help you in same. I will add explanation shortly too here(tested this with GNU awk 3.1.7 little old version of awk).
awk -v num="2,3" 'BEGIN{
len=split(num, val,",")
}
{while($0){
match($0,/.[^"]*/);
if(substr($0,RSTART,RLENGTH+1) && substr($0,RSTART,RLENGTH+1) !~ /\"\| \"/ && substr($0,RSTART,RLENGTH+1) !~ /^\"$/ && substr($0,RSTART,RLENGTH+1) !~ /^\" \"$/){
array[++i]=substr($0,RSTART,RLENGTH+1)
};
$0=substr($0,RLENGTH+1);
};
for(l=1;l<=len;l++){
delete array[val[l]]
};
for(j=1;j<=length(array);j++){
if(array[j]){
gsub(/^\"|\"$/,"",array[j]);
printf("%s%s",array[j],j==length(array)?"":" ")
}
};
print "";
i="";
delete array
}' Input_file
EDIT1: Adding a code with explanation too here.
awk -v num="2,3" 'BEGIN{ ##creating a variable named num whose value is comma seprated values of fields which you want to nullify, starting BEGIN section here.
len=split(num, val,",") ##creating an array named val here whose delimiter is comma and creating len variable whose value is length of array val here.
}
{while($0){ ##Starting a while loop here which will run for a single line till that line is NOT getting null.
match($0,/.[^"]*/);##using match functionality which will look for matches from starting to till a " comes into match.
if(substr($0,RSTART,RLENGTH+1) && substr($0,RSTART,RLENGTH+1) !~ /\"\| \"/ && substr($0,RSTART,RLENGTH+1) !~ /^\"$/ && substr($0,RSTART,RLENGTH+1) !~ /^\" \"$/){##So RSTATR and RLENGTH are the variables which will be set when a regex is having a match in line/variable passed into match function. In this if condition I am checking 1st: value of substring of RSTART,RLENGTH+1 should not be NULL. 2nd: Then checking this substring should not be having " pipe space ". 3rd condition: Checking if substring is NOT equal to a string which starts from " and ending with it. 4th condition: Checking here if substring is NOT equal to ^" space "$, if all conditions are TRUE then do following actions.
array[++i]=substr($0,RSTART,RLENGTH+1) ##creating an array named array whose index is variable i with increasing value of i and its value is substring of RSTART to till RLENGTH+1.
};
$0=substr($0,RLENGTH+1);##Now removing the matched part from current line which will decrease the length of line and avoid the while loop to become as infinite.
};
for(l=1;l<=len;l++){##Starting a loop here once while above loop is done which runs from starting of variable l=1 to value of len.
delete array[val[l]] ##Deleting here those values which we want to REMOVE from OPs request, so removing here.
};
for(j=1;j<=length(array);j++){##Start a for loop from the value of j=1 till the value of lengthh of array.
if(array[j]){ ##Now making sure array value whose index is j is NOT NULL, if yes then perform following statements.
gsub(/^\"|\"$/,"",array[j]); ##Globally substituting starting " and ending " with NULL in value of array value.
printf("%s%s",array[j],j==length(array)?"":" ") ##Now printing the value of array and secondly printing space or null depending upon if j value is equal to array length then print NULL else print space. It is because we don not want space at the last of the line.
}
};
print ""; ##Because above printf will NOT print a new line, so printing a new line.
i=""; ##Nullifying variable i here.
delete array ##Deleting array here.
}' Input_file ##Mentioning Input_file here.

How to search and replace multiple number sequences in a file using bash

I have a file with lots of lines like this:
dog:7066469:182:0:0:7050964:7087402:7058396:7079290:7087537
cat:7066469:182:0:0:7050964:7087402:7058396
dog:7066469:182:0:0:7050964:7087402:7058396:7079290
Using bash programming (sed or awk or something), how can I add a 6 in front of every number after the 5th ":", only lines that begins with "cat:"?
The correct result would be this:
dog:7066469:182:0:0:7050964:7087402:7058396:7079290:7087537
cat:7066469:182:0:0:67050964:67087402:67058396
dog:7066469:182:0:0:7050964:7087402:7058396:7079290
Using awk:
awk 'BEGIN{FS=OFS=":"} $1=="cat"{for (i=6; i<=NF; i++) $i = "6" $i} 1' file
dog:7066469:182:0:0:7050964:7087402:7058396:7079290:7087537
cat:7066469:182:0:0:67050964:67087402:67058396
dog:7066469:182:0:0:7050964:7087402:7058396:7079290

creating a ":" delimited list in bash script using awk

I have following lines
380:<CHECKSUM_VALIDATION>
393:</CHECKSUM_VALIDATION>
437:<CHECKSUM_VALIDATION>
441:</CHECKSUM_VALIDATION>
I need to format it as below
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Is it possible to achieve above output using "awk"? [I'm using bash]
Thanks you!
Here you go:
awk -F '[:<>/]+' '{ n = $1; getline; print $2 ":" n ":" $1 }'
Explanation:
Set the field separator with -F to be a sequence of a mix of :<>/ characters, this way the first field will be the number, and the second will be CHECKSUM_VALIDATION
Save the first field in variable n and read the next line (which would overwrite $1)
Print the line: a combination of the number from the previous line, and the fields on the current line
Another approach without using getline:
awk -F '[:<>/]+' 'NR % 2 { n = $1 } NR % 2 == 0 { print $2 ":" n ":" $1 }'
This one uses the record counter NR to determine whether it's time to print: if NR is odd, save the first field in n, if NR is even, then print.
You can try this sed,
sed 'N; s/\([0-9]\+\):<\(.*\)>\n\([0-9]\+\):<\(.*\)>/\2:\1:\3/' file.txt
Test:
sat:~$ sed 'N; s/\([0-9]\+\):<\(.*\)>\n\([0-9]\+\):<\(.*\)>/\2:\1:\3/' file.txt
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Another way:
awk -F: '/<C/ {printf "CHECKSUM_VALIDATION:%d:",$1; next} {print $1}'
Here is one gnu awk
awk -F"[:\n<>]" 'NR==1{print $3,$1,$5;f=$3;next} $3{print f,$3,$7}' OFS=":" RS="</CH" file
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Based on Jonas post and avoiding getline, this awk should do:
awk -F '[:<>/]+' '/<C/ {f=$1;next} { print $2,f,$1}' OFS=\: file
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441

Resources