How to replace a string like "[1.0 - 4.0]" with a numeric value using awk or sed? - bash

I have a CSV file that I am piping through a set of awk/sed commands.
Some lines in the CSV file look like this:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
where the 8th and 9th columns are a string representing a numeric range.
How can I use awk or sed to replace those fields with a numeric value? Either the beginning of the range, or the end of the range?
So this line would end up as
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
or
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,3.0,0.768
I got as far as removing the brackets but past that I'm stuck. I considered splitting on the " - ", but many lines in my file have a regular numeric value, not a range, in those last two columns, and that makes things messy (I don't want to end up with some lines having a different number of columns).

Here is a sed command that will take each range and break it up into two fields. It looks for strings like "[A - B]" and converts them to A,B. It can easily be modified to just use one of the values if needed by changing the \1,\2 portion. The regular expression assumes that all numbers have at least one digit on either side of a required decimal place. So, 1, .5, and 3. would not be valid. If you need that, the regex can be made to be more accommodating.
$ cat file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
$ sed -Ee 's|"\[([0-9]+\.[0-9]+) - ([0-9]+\.[0-9]+)\]"|\1,\2|g' file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,3.0,0.384,0.768

Since your data is field-based, awk is the logical choice.
Note that while awk generally isn't aware of double-quoted fields, that is not a problem here, because the double-quoted fields do not have embedded , instances.
#!/usr/bin/env bash
useStart1=1 # set to `0` to use the *end* of the *penultimate* fields' range instead.
useStart2=1 # set to `0` to use the *end* of the *last* field's range instead.
awk -v useStart1=$useStart1 -v useStart2=$useStart2 '
BEGIN { FS=OFS="," }
{
split($(NF-1), tokens1, /[][" -]+/)
split($NF, tokens2, /[][" -]+/)
$(NF-1) = useStart1 ? tokens1[2] : tokens1[3]
$NF = useStart2 ? tokens2[2] : tokens2[3]
print
}
' <<'EOF'
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
EOF
The code above yields:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
Modifying the values of $useStart1 and $useStart2 yields the appropriate variations.

Related

Prepending letter to field value

I have a file 0.txt containing the following value fields contents in parentheses:
(bread,milk,),
(rice,brand B,),
(pan,eggs,Brandc,),
I'm looking in OS and elsewhere for how to prepend the letter x to the beginning of each value between commas so that my output file becomes (using bash unix):
(xbread,xmilk,),
(xrice,xbrand B,),
(xpan,xeggs,xBrand C,),
the only thing I've really tried but not enough is:
awk '{gsub(/,/,",x");print}' 0.txt
for all purposes the prefix should not be applied to the last commas at the end of each line.
With awk
awk 'BEGIN{FS=OFS=","}{$1="(x"substr($1,2);for(i=2;i<=NF-2;i++){$i="x"$i}}1'
Explanation:
# Before you start, set the input and output delimiter
BEGIN{
FS=OFS=","
}
# The first field is special, the x has to be inserted
# after the opening (
$1="(x"substr($1,2)
# Prepend 'x' from field 2 until the previous to last field
for(i=2;i<=NF-2;i++){
$i="x"$i
}
# 1 is always true. awk will print in that case
1
The trick is to anchor the regexp so that it matches the whole comma-terminated substring you want to work with, not just the comma (and avoids other “special” characters in the syntax).
awk '{ gsub(/[^,()]+,/, "x&") } 1' 0.txt
sed -r 's/([^,()]+,)/x\1/g' 0.txt

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

bash sort quoted csv files by numeric key

I have the following input csv file:
"aaa","1","xxx"
"ccc, Inc.","6100","yyy"
"bbb","609","zzz"
I wish to sort by the second column as numbers,
I tried
sort --field-separator=',' --key=2n
the problem is that since all values are quoted, they don't get sorted correctly by -n (numeric) option. is there a solution?
A little trick, which uses a double quote as the separator:
sort --field-separator='"' --key=4 -n
For a quoted csv use a language that has a proper csv parser. Here is an example using perl.
perl -MText::ParseWords -lne '
chomp;
push #line, [ parse_line(",", 0, $_) ];
}{
#line = sort { $a->[1] <=> $b->[1] } #line;
for (#line) {
local $" = qw(",");
print qq("#$_");
}
' file
Output:
"aaa","1","xxx"
"bbb","609","zzz"
"ccc, Inc.","6100","yyy"
Explanation:
Remove the new line from input using chomp function.
Using a code module Text::Parsewords parse the quoted line and store it in an array of array without the quotes.
In the END block, sort the array of array on second column and assign it to the original array of array.
For every item in our array of array, we set the output list separator to "," and we print it with preceding and trailing " to create the lines in original format.
Dropping your example into a file called sort2.txt I found the following to work well.
sort -t'"' -k4n sort2.txt
Using sort with the following commands (thank you for the refinements Jonathan)
-t[optional single character separator other than tab. Defined within the single quotes]'"'.
-k4 choose the value in the fourth key.(k)delimited by ", and on the 4th key value
-n numeric sort
file name avoid the use of chaining as unnecessary
Hope this helps!
There isn't going to be a really simple solution. If you make some reasonable assumptions, then you could consider:
sed 's/","/^A/g' input.csv |
sort -t'^A' -k 2n |
sed 's/^A/","/g'
This replaces the "," sequence with Control-A (shown as ^A in the code), then uses that as the field delimiter in sort (the numeric sort on column 2), and then replace the Control-A characters with "," again.
If you use bash, you can use the ANSI C quoting mechanism $'\1' to embed the control characters visibly into the script; you just have to finish the single-quoted string before the escape, and restart it afterwards:
sed 's/","/'$'\1''/g' input.csv |
sort -t$'\1' -k 2n |
sed 's/'$'\1''/","/g'
Or play with double quotes instead of single quotes, but that gets messy because of the double quotes that you are replacing. But you can simply type the characters verbatim and editors like vim will be happy to show them to you.
Sometimes the values in the CSV file are optionally quoted, only when necessary. In this case, using " as a separator is not reliable.
Example:
"Forest fruits",198
Apples,456
bananas,67
Using awk, sort and cut, you can sort the original file, here by the first column :
awk -F',' '{
a = $1; # or the column index you want
gsub(/(^"|"$)/, "", a);
print a","$0
}' file.csv | sort -k1 | cut -d',' -f1 --complement
This will bring the column you want to sort on in front without quotes, then sort it the way you want, and remove this column at the end.

Resources