trying to remove certain characters from inside of double quotes but leave them intact as delimiters - bash

remove characters(semicolons) from inside of quoted string but keep them intact for delimiters
How do I get sed etc. to do this.
My input file
"1234";"ABCDE;";"9999"
"2344;";"PQRST"; "3456;"
My outpuft file needs to be cleaned up to look like
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
As seen above the semicolons need to be retained as delimiters but need to be removed from the quoted parts. Would anyone be able to let me know? Thanks.
I am actually doing some hive programming and my hive scripts are ready and running successfully (as tested on smaller sample data sets). Now those same scripts are giving me errors since these new big data sets are not clean and hence trying to clean them up (& learning sed etc. along the way).
regards,
Rahul

I think you would be better off with a CSV parser.
If you have gawk, you can use the FPAT variable. Try:
gawk 'BEGIN { FPAT="([^; ]+)|(\"[^\"]+\")"; OFS=";" } { for (i=1;i<=NF;i++) gsub(/;/, "", $i) }1' file
Results:
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
If, for whatever reason you cannot easily upgrade your distro, here's a solution using Perl and the CPAN module Text:CSV:
perl -MText::CSV -nle '
BEGIN { $csv = Text::CSV->new({ sep_char => ";", allow_whitespace => 1 }) }
$csv->parse($_) or die;
print join(";", map { s/;//g; s/^|$/"/g; $_ } $csv->fields())
' file
Results:
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"

Suppose that you don't have record like this:
";";";";";"
You can break your task into these steps:
cat input
"1234";"ABCDE;";"9999"
"2344;";"PQRST"; "3456;"
sed -r 's#"\s*;\s*"#|#g'
"1234|ABCDE;|9999"
"2344;|PQRST|3456;"
sed -r 's#[";]##g'
1234|ABCDE|9999
2344|PQRST|3456
sed -r 's#[^|]+#"&"#g'
"1234"|"ABCDE"|"9999"
"2344"|"PQRST"|"3456"
sed -r 's#\|#;#g'
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
Put all commands into one:
sed -r 's#"\s*;\s*"#|#g;s#[";]##g;s#[^|]+#"&"#g;s#\|#;#g' input

sed
kent$ sed -r 's/;"(;"|$)/"\1/g' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
awk
one-liner: longer version:
kent$ awk -F'"' -v OFS='"' '{for(i=1;i<=NF;i++)if($i~/\S+;$/){sub(/;$/,"",$i)}}7' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
oner-liner shorter but with trash(") in last line:
kent$ awk -v RS='"' -v ORS='"' '/\S+;$/{sub(/;$/,"")}7' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
"

Here is one way to do it with awk
awk '
{for (i=1;i<=NF;i++) {
if ($i=="\"") f=!f
if ($i==";" && f) $i=x
printf $i}
} {print ""}
' FS="" file
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
This test if ; is within a block of two ", if yes, remove it.
To also remove space between fields, use this:
awk '
{for (i=1;i<=NF;i++) {
if ($i=="\"") f=!f
if ($i==";" && f) $i=x
if ($i==" " && !f) $i=x
printf $i}
} {print ""}
' FS="" file

Related

change numerical value in file to characters via awk

I'm looking to replace the numerical values in a file with a new value provided by me. Can be present in any part of the text, in some cases, it comes across as the third position but is not always necessarily the case. Also to try and save a new version of the file.
original format
A:fdg:user#server:r
A:g:1234:xtcy
A:d:1111:xtcy
modified format
A:fdg:user#server:rxtTncC
A:g:replaced_value:xtcy
A:d:replaced_value:xtcy
bash line command with awk:
awk -v newValue="newVALUE" 'BEGIN{FS=OFS=":"} /:.:.*:/ && ~/^[0-9]+$/{~=newValue} 1' original_file.txt > replaced_file.txt
You can simply use sed instead of awk:
sed -E 's/\b[0-9]+\b/replaced_value/g' /path/to/infile > /path/to/outfile
Here is an awk that asks you for replacement values for each numerical value it meets:
$ awk '
BEGIN {
FS=OFS=":" # delimiters
}
{
for(i=1;i<=NF;i++) # loop all fields
if($i~/^[0-9]+$/) { # if numerical value found
printf "Provide replacement value for %d: ",$i > "/dev/stderr"
getline $i < "/dev/stdin" # ask for a replacement
}
}1' file_in > file_out # write output to a new file
I would use GNU AWK for this task following way, let file.txt content be
A:fdg:user#server:rxtTncC
A:g:1234:xtcy
A:d:1111:xtcy
then
awk 'BEGIN{newvalue="replacement"}{gsub(/[[:digit:]]+/,newvalue);print}' file.txt
output
A:fdg:user#server:rxtTncC
A:g:replacement:xtcy
A:d:replacement:xtcy
Explanation: replace one or more digits using newvalue. Disclaimer: I assumed numeric is something consisting solely from digits.
(tested in gawk 4.2.1)
How about
awk -F : '$3 ~ /^[0-9]+$/ { $3 = "new value"} {print}' original_file >replaced_file
?

Updating a specific field with sed

I'm trying to update a specific field on a specific line with the sed command in Bourne Shell.
Lets say I have a file TopScorer.txt
Player:Games:Goals:Assists
Salah:9:9:3
Kane:10:8:4
And I need to update the 3rd Column (Goals) of a player, I tried this command and it works unless Games and Goals have the same value then it updates the first one
player="Salah"
NewGoals="10"
OldGoals=$(awk -F':' '$1=="'$player'"' TopScorer.txt | cut -d':' -f3)
sed -i '/^'$player'/ s/'$OldGoals'/'$NewGoals'/' TopScorer.txt
Output> Salah:10:9:3 instead of Salah:9:10:3
Is there any solution? Should I use delimiters and $3==... to specify that field?
I also tried the option /2 for second occurrence but it's not very convenient in my case.
You can just do this with awk alone and not with sed. Also note that awk has an internal syntax to import variables from the shell. So your code just becomes
awk -F: -v pl="$player" -v goals="$NewGoals"
'BEGIN { OFS = FS } $1 == pl { $3= goals }1' TopScorer.txt
The -F: sets the input de-limiter as : and the part involving -v imports your shell variables to the context of awk. The BEGIN { OFS = FS } sets the output field separator to the same as input. Then we do the match using the imported variables and update $3 to the required value.
To make the modifications in-place, use a temporary file
awk -F: -v pl="$player" -v goals="$NewGoals"
'BEGIN { OFS = FS } $1 == pl { $3= goals }1' TopScorer.txt > tmpfile && mv tmpfile TopScorer.txt
This might work for you (GNU sed):
(player=Salah;newGoals=10;sed -i "/^$name/s/[^:]*/$newGoals/3" /tmp/file)
Use a sub shell so as not to pollute the current shell (...). Use sed and pattern matching to match the first field of each record to the variable player and replace the third field of the matching record with the contents of newGoals.
P.S. If the variables are needed in further processes the sub shell is not necessary i.e. remove the ( and )
You can try it with Perl
$ player="Salah"
$ NewGoals="10"
$ perl -F: -lane "\$F[2]=$NewGoals if ( \$F[0] eq $player ) ; print join(':',#F) " TopScorer.txt
Player:Games:Goals:Assists
Salah:9:10:3
Kane:10:8:4
$
or export them and call Perl one-liner within single quotes
$ export NewGoals="10"
$ export player="Salah"
$ perl -F: -lane '$F[2]=$ENV{NewGoals} if $F[0] eq $ENV{player} ; print join(":",#F) ' TopScorer.txt
Player:Games:Goals:Assists
Salah:9:10:3
Kane:10:8:4
$
Note that Perl has -i switch and you can do the replacement in-place, so
$ perl -i.bak -F: -lane '$F[2]=$ENV{NewGoals} if $F[0] eq $ENV{player} ; print join(":",#F) ' TopScorer.txt
$ cat TopScorer.txt
Player:Games:Goals:Assists
Salah:9:10:3
Kane:10:8:4
$
This will work .
With the first part of sed , i try to match a full line that math the player, and i keep all fields i want to keep by using \( .
The second part , i rebuild the line with some constants and the value of \1 and the value of \2
player="Salah"
NewGoals="10"
sed "s/^$player:\([^:]*\):[^:]*:\([^:]*\)\$/$player:\1:$NewGoals:\2/"
Could you please try following once. Advantage of this approach is that I am not hard coding field for Goals. This program will look for header's field wherever Goal is present(eg--> 4th or 5th any field), it will change for that specific column only.
1st Solution: When you need to make changes to all occurrences of player name then use following.
NewGoals=10
awk -v newgoals="$NewGoals" 'BEGIN{FS=OFS=":"} FNR==1{for(i=1;i<=NF;i++){if($i=="Goals"){field=i}}} FNR>1{if($1=="Salah"){$field=newgoals}} 1' Input_file
2nd Solution: In case you want to change a specific player's goals value to specific row only then try following.
NewGoals=10
awk -v newgoals="$NewGoals" 'BEGIN{FS=OFS=":"} FNR==1{for(i=1;i<=NF;i++){if($i=="Goals"){field=i}}} FNR>1{if($1=="Salah" && FNR==2){$field=newgoals}} 1' Input_file
Above will make changes only for row 2, you coud change it by changing FNR==2 in 2nd condition where FNR refers row number inawk. In case you want to save output into Input_file itself then you could append > temp_file && mv temp_file Input_file to above codes.

Iterative replacement of substrings in bash

I'm trying to write a simple script to make several replacements in a big text file. I've a "map" file which contains the records to be searched and replaced,one per line,separated by a space, and a "input" file where I need the changes to be done. The examples files and the script I wrote are beneath.
Map file
new_0 old_0
new_1 old_1
new_2 old_2
new_3 old_3
new_4 old_4
Input file
itsa(old_0)single(old_2)string(old_1)with(old_5)ocurrences(old_4)ofthe(old_3)records
Script
#!/bin/bash
while read -r mapline ; do
mapf1=`awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline"`
mapf2=`awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline"`
for line in $(cat "input") ; do
if [[ "${line}" == *"${mapf2}"* ]] ; then
sed "s/${mapf2}/${mapf1}/g" <<< "${line}"
fi
done < "input"
done < "map"
The thing is that the searches and replaces are made correctly, but I can't find a way to save the output of each iteration and work over it in the next. So, my output looks like this:
itsa(new_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(new_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(new_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(new_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(new_4)ofthe(old_3)records
Yet, the desired output would look like this:
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
May anyone bring some light in this darkly waters??? Thanks in advance!
Improving the existing script
Improvements:
Use "$()" instead of ``. It supports whitespace and is easier to read.
Don't execute sed for each line. sed already loops over all lines and is faster than a loop in bash.
The adapted script:
text="$(< input)"
while read -r mapline; do
mapf1="$(awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline")"
mapf2="$(awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline")"
text="$(sed "s/${mapf2}/${mapf1}/g" <<< "$text")"
done < "map"
echo "$text"
The variable $text contains the complete input file and is modified in each iteration. The output of this script is the file after all replacements were done.
Alternative approach
Convert the map file into a pattern for sed and execute sed just once using that pattern.
pattern="$(sed 's#\(.*\) \(.*\)#s/\2/\1/g#' map)"
sed "$pattern" input
The first command is the conversion step. The file
new_0 old_0
new_1 old_1
...
will result in the pattern
s/old_0/new_0/g
s/old_1/new_1/g
...
It is possible in GNU Awk as follows,
awk 'FNR==NR{hash[$2]=$1; next} \
{for (i=1; i<=NF; i++)\
{for(key in hash) \
{if (match ($i,key)) {$i=sprintf("(%s)",hash[key];break;)}}}print}' \
map-file FS='[()]' OFS= input-file
produces an output as,
itsa(new_0)single(new_2)string(new_1)withold_5ocurrences(new_4)ofthe(new_3)records
Another in Gnu awk, using split and ternary operator(s):
$ awk '
NR==FNR { a[$2]=$1; next }
{
n=split($0,b,"[()]")
for(i=1;i<=n;i++)
printf "%s%s",(i%2 ? b[i] : (b[i] in a? "(" a[b[i]] ")":"")),(i==n?ORS:"")
}' map foo
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
First you read in the map to a hash. When processing the file, split all records by ( and ). Every other could be in the map (i%2==0). While printfing test with ternary operator if matches are found from a and when there is a match, output it parenthesized.

Using awk to search for a line that starts with but also contains a string

I have a file that has multiple lines that starts with a keyword. I only want to modify one of them and it's easy to distinguish the two. I want the one that is under the [dbinfo] section. The domain name is static so I know that won't change.
awk -F '=' '$1 ~ /^dbhost/ {print $NF};' myfile.txt
myfile.txt
[ual]
path=/web/
dbhost=ez098sf
[dbinfo]
dbhost=ec0001.us-east-1.localdomain
dbname=ez098sf_default
dbpass=XXXXXX
You can use this awk command to first check for presence of [dbinfo] section and then modify dbhost parameter:
awk -v h='newhost' 'BEGIN{FS=OFS="="}
$0 == "[dbinfo]" {sec=1} sec && $1 == "dbhost"{$2 = h; sec=0} 1' file
[ual]
path=/web/
dbhost=ez098sf
[dbinfo]
dbhost=newhost
dbname=ez098sf_default
dbpass=XXXXXX
You want to utilize a little bit of a state machine here:
awk -F '=' '
$0 ~ /^\[.*\]/ {in_db_info=($0=="[dbinfo]"}
$0 ~ /^dbhost/{if (in_db_info) print $2;}' myfile.txt
You can also do it with sed:
sed '/\[dbinfo\]/,/\[/s/\(^dbhost=\).*/\1domain.com/' myfile.txt

Modify content inside quotation marks, BASH

Good day to all,
I was wondering how to modify the content inside quotation marks and left unmodified the outside.
Input line:
,,,"Investigacion,,, desarrollo",,,
Output line:
,,,"Investigacion, desarrollo",,,
Initial try:
sed 's/\"",,,""*/,/g'
But nothing happens, thanks in advance for any clue
The idiomatic awk way to do this is simply:
$ awk 'BEGIN{FS=OFS="\""} {sub(/,+/,",",$2)} 1' file
,,,"Investigacion, desarrollo",,,
or if you can have more than one set of quoted strings on each line:
$ cat file
,,,"Investigacion,,, desarrollo",,,"foo,,,,bar",,,
$ awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) sub(/,+/,",",$i)} 1' file
,,,"Investigacion, desarrollo",,,"foo,bar",,,
This approach works because everything up to the first " is field 1, and everything from there to the second " is field 2 and so on so everything between "s is the even-numbered fields. It can only fail if you have newlines or escaped double quotes inside your fields but that'd affect every other possible solution too so you'd need to add cases like that to your sample input if you want a solution that handles it.
Using a language that has built-in CSV parsing capabilities like perl will help.
perl -MText::ParseWords -ne '
print join ",", map { $_ =~ s/,,,/,/; $_ } parse_line(",", 1, $_)
' file
,,,"Investigacion, desarrollo",,,
Text::ParseWords is a core module so you don't need to download it from CPAN. Using the parse_line method we set the delimiter and a flag to keep the quotes. Then just do simple substitution and join the line to make your CSV again.
Using egrep, sed and tr:
s=',,,"Investigacion,,, desarrollo",,,'
r=$(egrep -o '"[^"]*"|,' <<< "$s"|sed '/^"/s/,\{2,\}/,/g'|tr -d "\n")
echo "$r"
,,,"Investigacion, desarrollo",,,
Using awk:
awk '{ p = ""; while (match($0, /"[^"]*,{2,}[^"]*"/)) { t = substr($0, RSTART, RLENGTH); gsub(/,+/, ",", t); p = p substr($0, 1, RSTART - 1) t; $0 = substr($0, RSTART + RLENGTH); }; $0 = p $0 } 1'
Test:
$ echo ',,,"Investigacion,,, desarrollo",,,' | awk ...
,,,"Investigacion, desarrollo",,,
$ echo ',,,"Investigacion,,, desarrollo",,,",,, "' | awk ...
,,,"Investigacion, desarrollo",,,", "

Resources