I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess
I have a semicolon-separated file with 10 fields on each line. I need to extract only the first 5 fields.
Input:
A.txt
1;abc ;xyz ;0.0000;3.0; ; ;0.00; ; xyz;
Output file:
B.txt
1;abc ;xyz ;0.0000;3.0;
You can cut from field1-5:
cut -d';' -f1-5 file
If the ending ; is needed, you can append it by other tool or using grep(assume your grep has -P option):
kent$ grep -oP '^(.*?;){5}' file
1;abc ;xyz ;0.0000;3.0;
In sed you can match the pattern string; 5 times:
sed 's/\(\([^;]*;\)\{5\}\).*/\1/' A.txt
or, when your sedsupports -r:
sed -r 's/(([^;]*;){5}).*/\1/' A.txt
cut -f-5 -d";" A.txt > B.txt
Where:
- -f selects the fields (-5 from start to 5)
- -d provides a delimiter, (here the semicolon)
Given that the input is field-based, using awk is another option:
awk 'BEGIN { FS=OFS=";"; ORS=OFS"\n" } { NF=5; print }' A.txt > B.txt
If you're using BSD/macOS, insert $1=$1; after NF=5; to make this work.
FS=OFS=";" sets both the input field separator, FS, and the output field separator, OFS, to a semicolon.
The input field separator is used to break each input record (line) into fields.
The output field separator is used to rebuild the record when individual fields are modified or the number of fields are modified.
ORS=OFS"\n" sets the output record separator to a semicolon followed by a newline, given that a trailing ; should be output.
Simply omit this statement if the trailing ; is undesired.
{ NF=5; print } truncates the input record to 5 fields, by setting NF, the number (count) of fields to 5 and then prints the modified record.
It is at this point that OFS comes into play: the first 5 fields are concatenated to form the output record, using OFS as the separator.
Note: BSD/macOS Awk doesn't modify the record just by setting NF; you must additionally modify a field explicitly for the changed field count to take effect: a dummy operation such as $1=$1 (assigning field 1 to itself) is sufficient.
awk '{print $1,$2,$3}' A.txt >B.txt
1;abc ;xyz ;0.0000;3.0;
I have multiple tab delimited files with the same column headers. However, the headers (1st row of the files) are delimited by white spaces instead of tabs. How can I convert the white space to tab on first line of a tab delimited file?
You can use sed for one line only:
sed -i.bak $'1s/ /\t/g' file.csv
Sounds like you can use awk:
awk -v OFS='\t' 'NR == 1 { $1 = $1 } 1' file
Assigning the first field of the first line $1 to itself causes awk to reformat the line, inserting the output field separator OFS (defined as a tab character). 1 is the shortest true condition, so awk does the default: { print } for every line.
To overwrite "in-place", use a temp file:
awk -v OFS='\t' 'NR == 1 { $1 = $1 } 1' file > tmp && mv tmp file
Note that this will interpret any number of spaces as a single field separator.
I have the requirement to convert row string data to column format and pre/postfix specific strings. The data string in file has 4 major fixed columns (separated by ";") and each column is further divided in two sections (separated by ":").
E.g.
Source data file:
A100:T100;B100:T200;A200:T300;B200:T400
Output from file should be:
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
Meanwhile I am trying with following code:
String="A100:T100;B100:T200;A200:T300;B200:T400"
> File.txt
for deploy in $(echo $String | tr ";" "\n")
do
echo $deploy >> File.txt
done
cat File.txt | awk 'BEGIN { FS=":"; OFS=":" } NR==1{ print "TABa:BatchID="$1,$2 } NR==2{ print "TABb:BatchID="$1,$2 }'
printf handles this:
$ awk -F: '{sub(/\n/,""); printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR }' RS=';' File.txt
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
How it works
-F:
This sets the field separator to a colon: :.
sub(/\n/,"")
This removes newline characters.
printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR
This does all the work. It makes use of the record number, NR, and the first and second fields and prints the output that you want.
RS=';'
This tells awk to use a semicolon, ;, as the record separator.
I am very new to the bash programming and need to convert a single text column to a single row and then separate the characters in the row based on the pattern.
I have text document with the column, which has one letter with six digits
in each line:
a111111
b222222
c333333
d444444
e555555
I need to transform the column above into the following row:
'a111111','b222222','c333333','d444444','e555555'
Could someone please advise how this can be achieved?
You can use awk with printf:
awk -v ORS=, 'NR>1{printf "%s", ORS} {printf "\x27%s\x27", $0}' file
\x27 prints a single quote.
For the 2nd record onwards it will prints ORS (which is set to comma) at start and then the quoted line will be printed.
Output:
'a111111','b222222','c333333','d444444','e555555'
Another approach:
sed -r 's/^|$/\x27/g' file | paste -sd,
sed adds the single quotes at the beginning and end of each line, and paste joins the line together with commas
Or, print a comma for each line, and when you're done back up 1 character and overwrite the last comma with a space:
awk '{printf "'\''%s'\'',", $0} END {printf "\b \n"}' file