I have the following string and I want to split it into 3 parts:
Text:
<http://rdf.freebase.com/ns/american_football.football_player.footballdb_id> <http://www.w3.org/2000/01/rdf-schema#label> "footballdb ID"#en
Output should be
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
basically an splitting a RDF'sh tuple into its parts.
I want to do this via a UNIX script , but I do not know sed or awk.
Please help.
If your input fields are tab-separated, this will produce your posted desired output:
$ awk -F'\t' '{ for (i=1;i<=NF;i++) printf "$%d = %s\n", i, $i }' file
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
Alternatively this might be what you want if your fields are not tab-separated:
$ cat tst.awk
{
gsub(/<[^>]+>/,"&\n")
split($0,a,/[[:space:]]*\n[[:space:]]*/)
for (i=1; i in a; i++)
printf "$%d = %s\n", i, a[i]
}
$
$ awk -f tst.awk file
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
If that's not how your input fields are separated and/or not what you want output, update your question to clarify.
read A B C <<< $string
echo -e "\$1 = $A\n\$2 = $B\n\$3 = $C"
Output:
$1 = <http://rdf.freebase.com/ns/american_football.football_player.footballdb_id>
$2 = <http://www.w3.org/2000/01/rdf-schema#label>
$3 = "footballdb ID"#en
Whatever you use to split the string needs to recognize not only the white space but also the convention that the double quote "protects" the blank space before ID and prevents it from splitting the fields. I fear this computation may be beyond what is possible with sed. You could do it in awk, but awk provides little special advantage here.
You show a space-separated format with quotes. A similar problem is to parse comma-separated format with quotes. Related questions:
Parse CSV with double quote in some cases
How to split csv whose columns may contain ,
echo "your string" |awk -F" " '{ print $1 $2 $3 $4}'
awk '{ print "$1 = " $1 "\n$2 = " $2 "\n$3 = " $3 }' filename
Related
I've started to learn bash and totally stuck with the task. I have a comma separated csv file with records like:
id,location_id,organization_id,service_id,name,title,email,department
1,1,,,Name surname,department1 department2 department3,,
2,1,,,name Surname,department1,,
3,2,,,Name Surname,"department1 department2, department3",, e.t.c.
I need to format it this way: name and surname must start with a capital letter
add an email record that consists of the first letter of the name and full surname in lowercase
create a new csv with records from the old csv with corrected fields.
I split csv on records using awk ( cause some fields contain fields with a comma between quotes "department1 department2, department3" ).
#!/bin/bash
input="$HOME/test.csv"
exec 0<$input
while read line; do
awk -v FPAT='"[^"]*"|[^,]*' '{
...
}' $input)
done
inside awk {...} (NF=8 for each record), I tried to use certain field values ($1 $2 $3 $4 $5 $6 $7 $8):
#it doesn't work
IFS=' ' read -a name_surname<<<$5 # Field 5 match to *name* in heading of csv
# Could I use inner awk with field values of outer awk ($5) to separate the field value of outer awk $5 ?
# as an example:
# $5="${awk '{${1^}${2^}}' $5}"
# where ${1^} and ${2^} fields of inner awk
name_surname[0]=${name_surname[0]^}
name_surname[1]=${name_surname[1]^}
$5="${name_surname[0]}' '${name_surname[1]}"
email_name=${name_surname[0]:0:1}
email_surname=${name_surname[1]}
domain='#domain'
$7="${email_name,}${email_surname,,}$domain" # match to field 7 *email* in heading of csv
how to add field values ($1 $2 $3 $4 $5 $6 $7 $8) to array and call function join for each for loop iteration to add record to new csv file?
function join { local IFS="$1"; shift; echo "$*"; }
result=$(join , ${arr[#]})
echo $result >> new.csv
This may be what you're trying to do (using gawk for FPAT as you already were doing) but without more representative sample input and the expected output it's a guess:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
NR > 1 {
n = split($5,name,/\s*/)
$7 = tolower(substr(name[1],1,1) name[n]) "#example.com"
print
}
' "${#:--}"
$ ./tst.sh test.csv
1,1,,,Name surname,department1 department2 department3,nsurname#example.com,
2,1,,,name Surname,department1,nsurname#example.com,
3,2,,,Name Surname,"department1 department2, department3",nsurname#example.com,
I put the awk script inside a shell script since that looks like what you want, obviously you don't need to do that you could just save the awk script in a file and invoke it with awk -f.
Completely working answer by Ed Morton.
If it may be will be helpful for someone, I added one more checking condition: if in CSV file more than one email address with the same name - index number is added to email local part and output is sent to file
#!/usr/bin/env bash
input="$HOME/test.csv"
exec 0<$input
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
(NR == 1) {print} #header of csv
(NR > 1) {
if (length($0) > 1) { #exclude empty lines
count = 0
n = split($5,name,/\s*/)
email_local_part = tolower(substr(name[1],1,1) name[n])
#array stores emails from csv file
a[i++] = email_local_part
#find amount of occurrences of the same email address
for (el in a) {
ret=match(a[el], email_local_part)
if (ret == 1) { count++ }
}
#add number of occurrence to email address
if (count == 1) { $7 = email_local_part "#abc.com" }
else { --count; $7 = email_local_part count "#abc.com" }
print
}
}
' "${#:--}" > new.csv
I have a variable in format "world,asia,india" the length of variable is not constant and changes. What I want is I want to format the variable as below.
"world" , "asia" , "india"
echo $variable | awk 'BEGIN { OFS = "," } { $1 = $1; print }'
no change in the variable
echo $variable | awk 'BEGIN { OFS = "," } { $1 = $1; print }'
no change in the variable
input:
"world,asia,india"
"earth,world,europe,france"
output:
"world" , "asia" , "india"
"earth", "world" , "europe" , "france"
I won't do it by tunning the OFS, because you still have to handle the first and last fields. Just go through all fields and wrap them with quotes. Straightforward:
awk -v q='"' 'BEGIN{FS=OFS=","}{for(x=1;x<=NF;x++)$x=q $x q}7'
The above line will do the trick.
This looks more like a search-and-replace problem than a field separator problem:
awk '{gsub(/,/, "\",\""); print}'
# or
sed 's/,/","/g'
I'm assuming your input actually has leading and trailing quotes already, and you only need to change the "inner" commas.
I really like #Kent answer.
I suggest we can allow the awk to do the parsing with $1=$1
echo $variable | awk 'BEGIN{FS=",";OFS="\",\""}{$1=$1}1'
I have to print out some values in a txt file.
they are of the following format
input="Sno;Name;Field1;Field2"
However the output must be:
Sno-Name
FIELDS ALLOCATED:
Field1
Field2
I do it like so:
echo $input | $(awk -F';' '{print $1"-"$2}') >>$txtfile
echo "FIELDS ALLOCATED:">>$txtfile
echo "$input" | cut -d';' -f 3,4 >>$txtfile
This is easy. However, the problem is that Field1 or Field2 can contain new lines. Whenever this happens, the cut or awk doesn't read the field number 4 and treats it as a new line. Do help how can I print the two fields (with new lines preserved) from the given input format.
If the input is well-formed, you can collect input lines until you have four fields.
awk -F ';' 'r { $0 = r ORS $0 }
NR<4 { next }
{ print $1 "-" $2
print "FIELDS ALLOCATED:"
print $3; print $4
print ""; r="" }' file
Single gnu-awk can do the job with FPAT and empty RS:
input=$'Sno;Name;Field1\nFoo;Field2'
awk -v RS= -v FPAT='[^;]+' '{
printf "%s-%s\nFIELDS ALLOCATED:\n%s\n%s\n", $1, $2, $3, $4}' <<< "$input"
Sno-Name
FIELDS ALLOCATED:
Field1
Foo
Field2
Just change the input record separator in awk - RS. < and > added around each field for clarity.
EDIT: removed extra trailing newline by adding ';' at the end of the here-doc data, plus another condition.
input="Sno;Name;Fie
ld1;Fi
eld2"
awk 'BEGIN{RS=";"} NR==1{f1=$0};
NR==2{print f1 "-" $0; print "FIELDS ALLOCATED:"}
$0=="\n"{next}
NR>2{print "<" $0 ">"}' <<< "$input;"
Gives:
Sno-Name
FIELDS ALLOCATED:
<Fie
ld1>
<Fi
eld2>
input=$'Sno;Name;Field1\nFoo;Field2'
awk 'BEGIN{ RS = "\n\n+" ; FS = ";" } { print $1"-"$2; for(i=3;i<=NF;i++) {print $i}}' <<<"$input"
Since it does not know how many field I can give, i added a for loop until NF and changed the RS to a blank line instead of newline.
Good day to all,
I was wondering how to modify the content inside quotation marks and left unmodified the outside.
Input line:
,,,"Investigacion,,, desarrollo",,,
Output line:
,,,"Investigacion, desarrollo",,,
Initial try:
sed 's/\"",,,""*/,/g'
But nothing happens, thanks in advance for any clue
The idiomatic awk way to do this is simply:
$ awk 'BEGIN{FS=OFS="\""} {sub(/,+/,",",$2)} 1' file
,,,"Investigacion, desarrollo",,,
or if you can have more than one set of quoted strings on each line:
$ cat file
,,,"Investigacion,,, desarrollo",,,"foo,,,,bar",,,
$ awk 'BEGIN{FS=OFS="\""} {for (i=2;i<=NF;i+=2) sub(/,+/,",",$i)} 1' file
,,,"Investigacion, desarrollo",,,"foo,bar",,,
This approach works because everything up to the first " is field 1, and everything from there to the second " is field 2 and so on so everything between "s is the even-numbered fields. It can only fail if you have newlines or escaped double quotes inside your fields but that'd affect every other possible solution too so you'd need to add cases like that to your sample input if you want a solution that handles it.
Using a language that has built-in CSV parsing capabilities like perl will help.
perl -MText::ParseWords -ne '
print join ",", map { $_ =~ s/,,,/,/; $_ } parse_line(",", 1, $_)
' file
,,,"Investigacion, desarrollo",,,
Text::ParseWords is a core module so you don't need to download it from CPAN. Using the parse_line method we set the delimiter and a flag to keep the quotes. Then just do simple substitution and join the line to make your CSV again.
Using egrep, sed and tr:
s=',,,"Investigacion,,, desarrollo",,,'
r=$(egrep -o '"[^"]*"|,' <<< "$s"|sed '/^"/s/,\{2,\}/,/g'|tr -d "\n")
echo "$r"
,,,"Investigacion, desarrollo",,,
Using awk:
awk '{ p = ""; while (match($0, /"[^"]*,{2,}[^"]*"/)) { t = substr($0, RSTART, RLENGTH); gsub(/,+/, ",", t); p = p substr($0, 1, RSTART - 1) t; $0 = substr($0, RSTART + RLENGTH); }; $0 = p $0 } 1'
Test:
$ echo ',,,"Investigacion,,, desarrollo",,,' | awk ...
,,,"Investigacion, desarrollo",,,
$ echo ',,,"Investigacion,,, desarrollo",,,",,, "' | awk ...
,,,"Investigacion, desarrollo",,,", "
Data file - data.txt:
ABC "I am ABC" 35 DESC
DEF "I am not ABC" 42 DESC
cat data.txt | awk '{print $2}'
will result the "I" instead of the string being quoted
How to make awk so that it ignore the space within the quote and think that it is one single token?
Another alternative would be to use the FPAT variable, that defines a regular expression describing the contents of each field.
Save this AWK script as parse.awk:
#!/bin/awk -f
BEGIN {
FPAT = "([^ ]+)|(\"[^\"]+\")"
}
{
print $2
}
Make it executable with chmod +x ./parse.awk and parse your data file as ./parse.awk data.txt:
"I am ABC"
"I am not ABC"
Yes, this can be done nicely in awk. It's easy to get all the fields without any serious hacks.
(This example works in both The One True Awk and in gawk.)
{
split($0, a, "\"")
$2 = a[2]
$3 = $(NF - 1)
$4 = $NF
print "and the fields are ", $1, "+", $2, "+", $3, "+", $4
}
Try this:
$ cat data.txt | awk -F\" '{print $2}'
I am ABC
I am not ABC
The top answer for this question only works for lines with a single quoted field. When I found this question I needed something that could work for an arbitrary number of quoted fields.
Eventually I came upon an answer by Wintermute in another thread, and he provided a good generalized solution to this problem. I've just modified it to remove the quotes. Note that you need to invoke awk with -F\" when running the below program.
BEGIN { OFS = "" } {
for (i = 1; i <= NF; i += 2) {
gsub(/[ \t]+/, ",", $i)
}
print
}
This works by observing that every other element in the array will be inside of the quotes when you separate by the "-character, and so it replaces the whitespace dividing the ones not in quotes with a comma.
You can then easily chain another instance of awk to do whatever processing you need (just use the field separator switch again, -F,).
Note that this might break if the first field is quoted - I haven't tested it. If it does, though, it should be easy to fix by adding an if statement to start at 2 rather than 1 if the first character of the line is a ".
I've scrunched up together a function that re-splits $0 into an array called B. Spaces between double quotes are not acting as field separators. Works with any number of fields, a mix of quoted and unquoted ones. Here goes:
#!/usr/bin/gawk -f
# Resplit $0 into array B. Spaces between double quotes are not separators.
# Single quotes not handled. No escaping of double quotes.
function resplit( a, l, i, j, b, k, BNF) # all are local variables
{
l=split($0, a, "\"")
BNF=0
delete B
for (i=1;i<=l;++i)
{
if (i % 2)
{
k=split(a[i], b)
for (j=1;j<=k;++j)
B[++BNF] = b[j]
}
else
{
B[++BNF] = "\""a[i]"\""
}
}
}
{
resplit()
for (i=1;i<=length(B);++i)
print i ": " B[i]
}
Hope it helps.
Okay, if you really want all three fields, you can get them, but it takes a lot of piping:
$ cat data.txt | awk -F\" '{print $1 "," $2 "," $3}' | awk -F' ,' '{print $1 "," $2}' | awk -F', ' '{print $1 "," $2}' | awk -F, '{print $1 "," $2 "," $3}'
ABC,I am ABC,35
DEF,I am not ABC,42
By the last pipe you've got all three fields to do whatever you'd like with.
Here is something like what I finally got working that is more generic for my project.
Note it doesn't use awk.
someText="ABC \"I am ABC\" 35 DESC '1 23' testing 456"
putItemsInLines() {
local items=""
local firstItem="true"
while test $# -gt 0; do
if [ "$firstItem" == "true" ]; then
items="$1"
firstItem="false"
else
items="$items
$1"
fi
shift
done
echo "$items"
}
count=0
while read -r valueLine; do
echo "$count: $valueLine"
count=$(( $count + 1 ))
done <<< "$(eval putItemsInLines $someText)"
Which outputs:
0: ABC
1: I am ABC
2: 35
3: DESC
4: 1 23
5: testing
6: 456