Modify content inside quotation marks, BASH - bash

Good day to all,
I was wondering how to modify the content inside quotation marks and left unmodified the outside.
Input line:
,,,"Investigacion,,, desarrollo",,,
Output line:
,,,"Investigacion, desarrollo",,,
Initial try:
sed 's/\"",,,""*/,/g'
But nothing happens, thanks in advance for any clue

The idiomatic awk way to do this is simply:
$ awk 'BEGIN{FS=OFS="\""} {sub(/,+/,",",$2)} 1' file
,,,"Investigacion, desarrollo",,,
or if you can have more than one set of quoted strings on each line:
$ cat file
,,,"Investigacion,,, desarrollo",,,"foo,,,,bar",,,
$ awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) sub(/,+/,",",$i)} 1' file
,,,"Investigacion, desarrollo",,,"foo,bar",,,
This approach works because everything up to the first " is field 1, and everything from there to the second " is field 2 and so on so everything between "s is the even-numbered fields. It can only fail if you have newlines or escaped double quotes inside your fields but that'd affect every other possible solution too so you'd need to add cases like that to your sample input if you want a solution that handles it.

Using a language that has built-in CSV parsing capabilities like perl will help.
perl -MText::ParseWords -ne '
print join ",", map { $_ =~ s/,,,/,/; $_ } parse_line(",", 1, $_)
' file
,,,"Investigacion, desarrollo",,,
Text::ParseWords is a core module so you don't need to download it from CPAN. Using the parse_line method we set the delimiter and a flag to keep the quotes. Then just do simple substitution and join the line to make your CSV again.

Using egrep, sed and tr:
s=',,,"Investigacion,,, desarrollo",,,'
r=$(egrep -o '"[^"]*"|,' <<< "$s"|sed '/^"/s/,\{2,\}/,/g'|tr -d "\n")
echo "$r"
,,,"Investigacion, desarrollo",,,

Using awk:
awk '{ p = ""; while (match($0, /"[^"]*,{2,}[^"]*"/)) { t = substr($0, RSTART, RLENGTH); gsub(/,+/, ",", t); p = p substr($0, 1, RSTART - 1) t; $0 = substr($0, RSTART + RLENGTH); }; $0 = p $0 } 1'
Test:
$ echo ',,,"Investigacion,,, desarrollo",,,' | awk ...
,,,"Investigacion, desarrollo",,,
$ echo ',,,"Investigacion,,, desarrollo",,,",,, "' | awk ...
,,,"Investigacion, desarrollo",,,", "

Related

Bash - how to convert lines into a single line of quoted values

Given a list of values returned by a previous command:
ABC-55080
ABC-55060
ABC-55040
ABC-55035
ABC-55030
ABC-55025
ABC-55020
I want to get a single-lined list of quoted values:
("ABC-55060", "ABC-55040", "ABC-55035", "ABC-55030", "ABC-55025", "ABC-55020")
I've tried to do this using awk:
cat input_list.csv | awk '{print}' ORS='", "'
But what I get is the list without the opening and closing quotes:
ABC-55060", "ABC-55040", "ABC-55035", "ABC-55030", "ABC-55025", "ABC-55020
How can achieve this?
$ awk '{o=o s "\""$0"\""; s=", "} END{print "(" o ")"}' file
("ABC-55080", "ABC-55060", "ABC-55040", "ABC-55035", "ABC-55030", "ABC-55025", "ABC-55020")
Could you please try following, written and tested with shown samples in GNU awk. Written and tested in link
https://ideone.com/zO1eYf
awk '
BEGIN{
s1="\""
OFS=", "
}
FNR>1{
val=(val?val OFS:"")s1 $0 s1
}
END{
print "(" val ")"
}
' Input_file
Explanation: In BEGIN section setting value of variable s1 as " and setting output field separator as , here. Then coming to main program block checking condition if line is greater than 1 then keep adding that line's value to variable val with wrapping it's value with s1. In END block of this program printing value of val with adding ( and ) before and after val value respectively.
When you want to use sed:
sed 's/.*/"&",/;1s/^/(/;$s/,$/)/' input_list.csv | tr -d "\n"
When your inputfile is used as an example and actually you want to process the output of a stream, you can use
cat input_list.csv | xargs -I"{}" printf '"%s", ' "{}" | sed 's/^/(/; s/, $/)\n/'
If ed is available/acceptable.
#!/bin/sh
ed -s input_list.csv <<-'EOF'
g/./s/^/"/\
s/$/", /
1,$j
s/^/(/
s/,[[:space:]]$/)/
,p
Q
EOF
Change Q to w if you want to edit the file-inplace.

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

trying to remove certain characters from inside of double quotes but leave them intact as delimiters

remove characters(semicolons) from inside of quoted string but keep them intact for delimiters
How do I get sed etc. to do this.
My input file
"1234";"ABCDE;";"9999"
"2344;";"PQRST"; "3456;"
My outpuft file needs to be cleaned up to look like
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
As seen above the semicolons need to be retained as delimiters but need to be removed from the quoted parts. Would anyone be able to let me know? Thanks.
I am actually doing some hive programming and my hive scripts are ready and running successfully (as tested on smaller sample data sets). Now those same scripts are giving me errors since these new big data sets are not clean and hence trying to clean them up (& learning sed etc. along the way).
regards,
Rahul
I think you would be better off with a CSV parser.
If you have gawk, you can use the FPAT variable. Try:
gawk 'BEGIN { FPAT="([^; ]+)|(\"[^\"]+\")"; OFS=";" } { for (i=1;i<=NF;i++) gsub(/;/, "", $i) }1' file
Results:
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
If, for whatever reason you cannot easily upgrade your distro, here's a solution using Perl and the CPAN module Text:CSV:
perl -MText::CSV -nle '
BEGIN { $csv = Text::CSV->new({ sep_char => ";", allow_whitespace => 1 }) }
$csv->parse($_) or die;
print join(";", map { s/;//g; s/^|$/"/g; $_ } $csv->fields())
' file
Results:
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
Suppose that you don't have record like this:
";";";";";"
You can break your task into these steps:
cat input
"1234";"ABCDE;";"9999"
"2344;";"PQRST"; "3456;"
sed -r 's#"\s*;\s*"#|#g'
"1234|ABCDE;|9999"
"2344;|PQRST|3456;"
sed -r 's#[";]##g'
1234|ABCDE|9999
2344|PQRST|3456
sed -r 's#[^|]+#"&"#g'
"1234"|"ABCDE"|"9999"
"2344"|"PQRST"|"3456"
sed -r 's#\|#;#g'
"1234";"ABCDE";"9999"
"2344";"PQRST";"3456"
Put all commands into one:
sed -r 's#"\s*;\s*"#|#g;s#[";]##g;s#[^|]+#"&"#g;s#\|#;#g' input
sed
kent$ sed -r 's/;"(;"|$)/"\1/g' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
awk
one-liner: longer version:
kent$ awk -F'"' -v OFS='"' '{for(i=1;i<=NF;i++)if($i~/\S+;$/){sub(/;$/,"",$i)}}7' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
oner-liner shorter but with trash(") in last line:
kent$ awk -v RS='"' -v ORS='"' '/\S+;$/{sub(/;$/,"")}7' f
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
"
Here is one way to do it with awk
awk '
{for (i=1;i<=NF;i++) {
if ($i=="\"") f=!f
if ($i==";" && f) $i=x
printf $i}
} {print ""}
' FS="" file
"1234";"ABCDE";"9999"
"2344";"PQRST"; "3456"
This test if ; is within a block of two ", if yes, remove it.
To also remove space between fields, use this:
awk '
{for (i=1;i<=NF;i++) {
if ($i=="\"") f=!f
if ($i==";" && f) $i=x
if ($i==" " && !f) $i=x
printf $i}
} {print ""}
' FS="" file

Awk consider double quoted string as one token and ignore space in between

Data file - data.txt:
ABC "I am ABC" 35 DESC
DEF "I am not ABC" 42 DESC
cat data.txt | awk '{print $2}'
will result the "I" instead of the string being quoted
How to make awk so that it ignore the space within the quote and think that it is one single token?
Another alternative would be to use the FPAT variable, that defines a regular expression describing the contents of each field.
Save this AWK script as parse.awk:
#!/bin/awk -f
BEGIN {
FPAT = "([^ ]+)|(\"[^\"]+\")"
}
{
print $2
}
Make it executable with chmod +x ./parse.awk and parse your data file as ./parse.awk data.txt:
"I am ABC"
"I am not ABC"
Yes, this can be done nicely in awk. It's easy to get all the fields without any serious hacks.
(This example works in both The One True Awk and in gawk.)
{
split($0, a, "\"")
$2 = a[2]
$3 = $(NF - 1)
$4 = $NF
print "and the fields are ", $1, "+", $2, "+", $3, "+", $4
}
Try this:
$ cat data.txt | awk -F\" '{print $2}'
I am ABC
I am not ABC
The top answer for this question only works for lines with a single quoted field. When I found this question I needed something that could work for an arbitrary number of quoted fields.
Eventually I came upon an answer by Wintermute in another thread, and he provided a good generalized solution to this problem. I've just modified it to remove the quotes. Note that you need to invoke awk with -F\" when running the below program.
BEGIN { OFS = "" } {
for (i = 1; i <= NF; i += 2) {
gsub(/[ \t]+/, ",", $i)
}
print
}
This works by observing that every other element in the array will be inside of the quotes when you separate by the "-character, and so it replaces the whitespace dividing the ones not in quotes with a comma.
You can then easily chain another instance of awk to do whatever processing you need (just use the field separator switch again, -F,).
Note that this might break if the first field is quoted - I haven't tested it. If it does, though, it should be easy to fix by adding an if statement to start at 2 rather than 1 if the first character of the line is a ".
I've scrunched up together a function that re-splits $0 into an array called B. Spaces between double quotes are not acting as field separators. Works with any number of fields, a mix of quoted and unquoted ones. Here goes:
#!/usr/bin/gawk -f
# Resplit $0 into array B. Spaces between double quotes are not separators.
# Single quotes not handled. No escaping of double quotes.
function resplit( a, l, i, j, b, k, BNF) # all are local variables
{
l=split($0, a, "\"")
BNF=0
delete B
for (i=1;i<=l;++i)
{
if (i % 2)
{
k=split(a[i], b)
for (j=1;j<=k;++j)
B[++BNF] = b[j]
}
else
{
B[++BNF] = "\""a[i]"\""
}
}
}
{
resplit()
for (i=1;i<=length(B);++i)
print i ": " B[i]
}
Hope it helps.
Okay, if you really want all three fields, you can get them, but it takes a lot of piping:
$ cat data.txt | awk -F\" '{print $1 "," $2 "," $3}' | awk -F' ,' '{print $1 "," $2}' | awk -F', ' '{print $1 "," $2}' | awk -F, '{print $1 "," $2 "," $3}'
ABC,I am ABC,35
DEF,I am not ABC,42
By the last pipe you've got all three fields to do whatever you'd like with.
Here is something like what I finally got working that is more generic for my project.
Note it doesn't use awk.
someText="ABC \"I am ABC\" 35 DESC '1 23' testing 456"
putItemsInLines() {
local items=""
local firstItem="true"
while test $# -gt 0; do
if [ "$firstItem" == "true" ]; then
items="$1"
firstItem="false"
else
items="$items
$1"
fi
shift
done
echo "$items"
}
count=0
while read -r valueLine; do
echo "$count: $valueLine"
count=$(( $count + 1 ))
done <<< "$(eval putItemsInLines $someText)"
Which outputs:
0: ABC
1: I am ABC
2: 35
3: DESC
4: 1 23
5: testing
6: 456

rearrange data

If I have a list of data in a text file seperated by a new line, is there a way to append something to the start, then the data, then append something else then the data again?
EG a field X would become new X = X;
Can you do this with bash or sed or just unix tools like cut?
EDIT:
I am trying to get "ITEM_SITE_ID :{$row['ITEM_SITE_ID']} " .
I am using this line awk '{ print "\""$1 " {:$row['$1']} " }'
And I get this "ITEM_SITE_ID {:$row[]}
What have I missed?
I think the problem is your single quotes are not properly escaped, which is actually impossible to do.
With sed:
sed "s/\(.*\)/\1 = \1;/"
Or in your case:
sed "s/\(.*\)/\"\1 :{\$row['\1']}\"/"
And with bash:
while read line
do
echo "\"$line :{\$row['$line']}\""
done
And actually you can do it in awk using bashes $'' strings:
awk $'{ print "\\"" $1 " :{$row[\'" $1 "\']}\\"" }'
Awk is often the perfect tool for tasks like this. For your specific example:
awk '{ print "new " $1 " = " $1 ";" }'

Resources