Ignore comma after backslash in a line in a text file using awk or sed

Ignore comma after backslash in a line in a text file using awk or sed - bash

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?

Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.

You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.

Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.

This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.

Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess

You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.

Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

Related

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

I have to input.txt file which needs to be formatted by shell script with following condition
remove first two lines and
last two lines
remove all spaces in each
lines(each line have two spaces at
beginning and one space at end)
Each line should be within single
quotes(' ')
At last replace newline($) with
commas.
(original)
input.txt
sql
--------
Abce
Bca
Efr
-------
Row (3)
Desired output file
output.txt
'Abce','Bca','Efr'
I have tried using following commands
Sed -i 1,2d input.txt > input.txt
Sed "$(( $(wc -l <input.txt) -2+1)), $ d" Input.txt > input.txt
Sed ':a;N;$!ba;s/\n/, /g' input.txt > output.txt
But i get blank output.txt

Would you please try the following:
mapfile -t ary < <(tail -n +3 input.txt | head -n -2 | sed -E "s/^[[:blank:]]*/'/; s/[[:blank:]]*$/'/")
(IFS=,; echo "${ary[*]}")
tail -n +3 outputs lines after the 3rd line, inclusive.
head -n -2 outputs lines excluding the last 2 lines.
sed -E "s/^[[:blank:]]*/'/" removes leading whitespaces and prepends
a single quote.
Similarly the sed command "s/[[:blank:]]*$/'/" removes trailing
whitespaces and appends a single quote.
The syntax <(command ..) is a process substitution and the
output of the commands within the parentheses is fed to the mapfile
via the redirect.
mapfile -t ary reads lines from the standard input into the array
variable named ary.
echo "${ary[*]}" expands to a single string with the contents of
the array ary separated by the value of IFS, which is just assigned
to a comma.
The assignment of IFS and the array expansion are enclosed with
parentheses to be executed in the subshell. This prevents the IFS
to be modified in the current process.

With your shown samples, please try following awk program. Written and tested in GNU awk, should work with any version.
awk -v s1="'" -v lines="$(wc -l < Input_file)" '
BEGIN{ OFS="," }
FNR==(lines-1) {
print val
exit
}
FNR>2{
sub(/^[[:space:]]+/,"")
val=(val?val OFS:"") (s1 $0 s1)
}
' Input_file
Explanation: Adding detailed explanation for above code, this is only for explanation purposes.
awk -v s1="'" -v lines="$(wc -l < Input_file)" ' ##Starting awk program, setting s1 variable to ' and creating lines which has total number of lines in it, using wc -l command on Input_file file.
BEGIN{ OFS="," } ##Setting OFS to comma in BEGIN section of this program.
FNR==(lines-1) { ##Checking condition if its 2nd last line of Input_file.
print val ##Then printing val here.
exit ##exiting from program from here.
}
FNR>2{ ##Checking condition if FNR is greater than 2 then do following.
sub(/^[[:space:]]+/,"") ##Substituting initial spaces with NULL here.
val=(val?val OFS:"") (s1 $0 s1) ##Creating val which has ' current line ' in it and keep adding it in val.
}
' Input_file ##Mentioning Input_file name here.

If you know the input is small enough to fit in memory:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); out=out sep p2; sep="," }
{ p2=p1; p1=$0 }
END { print out }
' input.txt
'Abce','Bca','Efr'
Otherwise:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); printf "%s%s", sep, p2; sep="," }
{ p2=p1; p1=$0 }
END { print "" }
' input.txt
'Abce','Bca','Efr'
Either script will work using any awk in any shell on every Unix box.

This might work for you (GNU sed):
sed -E '1,2d;$!H;$!d;x;s/^\s*(.*)\s*$/'\''\1'\''/mg;s/\n[^\n]*$//;y/\n/,/' file
Delete the first two lines.
Append each line to the hold space, except for the last (this means the second from last line will still be present - see later).
Delete all lines except for the last.
Swap to the hold space.
Remove all spaces either side of the words on each line and surround those words by single quotes.
Remove the last line and its newline.
Replace all newlines by commas.

The first sed -i overwrites input.txt with an empty file. You can't write output back to the file you are reading, and sed -i does not produce any output anyway.
The minimal fix is to take out the -i and string together the commands into a pipeline; but of course, sed allows you to combine the commands into a single script.
len=$(wc -l <input.txt)
sed -e '1,2d' -e "$((len - 3))"',$d' \
-e ':a' \
-e 's/^ \(.*\) $/'"'\\1'/" \
-e N -e '$!ba' -e 's/\n/, /g' input.txt >output.txt
(Untested; if your sed does not allow multiple -e options, needs refactoring to use a single string with semicolons or newlines between the commands.)
This is hard to write and debug and brittle because of the ways you have to combine the quoting features of the shell with the requirements of sed and this particular script, but also more inherently because sed is a terse and obscure language.
A much more legible and maintainable solution is to switch to Awk, which allows you to express the logic in more human terms, and avoid having to pull in support from the shell for simple tasks like arithmetic and string formatting.
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("\047%s\047,", $0); }
END { for(j=1; j < i-1; ++j) printf "%s", a[j] }' input.txt >output.txt
This literally replaces all newlines with commas; perhaps you would in fact like to print a newline instead of the comma on the last line?
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("%s\047%s\047", sep, $0); sep="," }
END { for(j=1; j < i-1; ++j) printf "%s", a[j]; printf "\n" }' input.txt >output.txt
If the input file is really large, you might want to refactor this to not keep all the lines in memory. The array a collects the formatted output and we print all its elements except the last two in the END block.

sed -E '
/^-+$/,/^-+$/!d
//d
s/^[[:space:]]*|[[:space:]]*$/'\''/g
' input.txt |
paste -sd ,
This uses a trick that doesn't work on all sed implementations, to print the lines between two patterns (the dashes in this case), excluding those patterns.
On the plus side if the ---- pattern is at a different line number, it still works. Down side is it breaks, if that pattern (a line containing only dashes) occurs an odd number of times (ie. not in pairs, that wrap the lines you want).
Then sub line start and end (including white space) with single quotes.
Finally pipe to paste to sub the new lines with commas, excluding a trailing comma.

Using sed
$ sed "1,2d; /-/,$ d; s/\s\+//;s/.*/'&'/" input_file | sed -z 's/\n/,/g;s/,$/\n/'
'Abce','Bca','Efr'

I'll post a sed solution which is rather light.
sed '$d' input.txt | sed "\$d; 1,2d; s/^\s*\|\s*$/'/g" | paste -sd ',' > output.txt
$d Remove last line with first sed
\$d Remove the last line. $ escaped with backslash as we are within double-quotes.
1,2d Remove the first two lines.
s/^\s*\|\s*$/'/g Replace all leading and trailing whitespace with single quotes.
Use paste to concatenate to a single, comma delimited strings.
If we know that the relevant lines always start with two spaces, then it can even be simplified further.
sed -n "s/\s*$/'/; s/^ /'/p" input.txt | paste -sd ',' > output.txt
-n suppress printing lines unless told to
s/\s*$/'/ replace trailing whitespace with single quotes
s/^ /'/p replace two leading spaces and print lines that match
paste to concat
Then an awk solution:
awk -v i=1 -v q=\' 'FNR>2 {
gsub(/^[[:space:]]*|[[:space:]]*$/, q)
a[i++]=$0
} END {
for(i=1; i<=length(a)-3; i++)
printf "%s,", a[i]
print a[i++]
}' input.txt > output.txt
-v i=1 create an awk variable starting at one
-v q=\' create an awk variable for the single quote character
FNR>2 { ... tells it to only process line 3+
gsub(/^[[:space:]]*|[[:space:]]*$/, q) substitute leading and trailing whitespace with single quotes
a[i++]=$0 add line to array
END { ... Process the rest after reaching end of file
for(i=1; i<=length(a)-3; i++) take the length of the array but subtract three -- representing the last three lines
printf "%s,", a[i] print all but last three entries comma delimited
print a[i++] print next entry and complete the script (skipping the last two entries)

Not a one liner but works
sed "s/^ */\'/;s/\$/\',/;1,2d;N;\$!P;\$!D;\$d" | sed ' H;1h;$!d;x;s/\n//g;s/,$//'
Explanation:
s/^ */\'/;s/\$/\',/ ---> Adds single quotes and comma
N;$!P;$!D;$d ---> Deletes last two lines
H;1h;$!d;x;s/\n//g;s/,$//' ---> Loads entire file and merge all lines and remove last comma

AWK Finding a way to print lines containing a word from a comma separated string

I want to write a bash script that only prints lines that, on their second column, contain a word from a comma separated string. Example:
words="abc;def;ghi;jkl"
>cat log1.txt
hello;abc;1234
house;ab;987
mouse;abcdef;654
What I want is to print only lines that contain a whole word from the "words" variable. That means that "ab" won't match, neither will "abcdef". It seems so simple yet after trying for manymany hours, I was unable to find a solution.
For example, I tried this as my awk command, but it matched any substring.
-F \; -v b="TSLA;NVDA" 'b ~ $2 { print $0 }'
I will appreciate any help. Thank you.
EDIT:
A sample input would look like this
1;UNH;buy;344.74
2;PG;sell;138.60
3;MSFT;sell;237.64
4;TSLA;sell;707.03
A variable like this would be set
filter="PG;TSLA"
And according to this filter, I want to echo these lines
2;PG;sell;138.60
4;TSLA;sell;707.03

Grep is a good choice here:
grep -Fw -f <(tr ';' '\n' <<<"$words") log1.txt
With awk I'd do
awk -F ';' -v w="$words" '
BEGIN {
n = split(w, a, /;/)
# next line moves the words into the _index_ of an array,
# to make the file processing much easier and more efficient
for (i=1; i<=n; i++) words[a[i]]=1
}
$2 in words
' log1.txt

You may use this awk:
words="abc;def;ghi;jkl"
awk -F';' -v s=";$words;" 'index(s, FS $2 FS)' log1.txt
hello;abc;1234

Unix row to column format with string prefix and post fix

I have the requirement to convert row string data to column format and pre/postfix specific strings. The data string in file has 4 major fixed columns (separated by ";") and each column is further divided in two sections (separated by ":").
E.g.
Source data file:
A100:T100;B100:T200;A200:T300;B200:T400
Output from file should be:
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
Meanwhile I am trying with following code:
String="A100:T100;B100:T200;A200:T300;B200:T400"
> File.txt
for deploy in $(echo $String | tr ";" "\n")
do
echo $deploy >> File.txt
done
cat File.txt | awk 'BEGIN { FS=":"; OFS=":" } NR==1{ print "TABa:BatchID="$1,$2 } NR==2{ print "TABb:BatchID="$1,$2 }'

printf handles this:
$ awk -F: '{sub(/\n/,""); printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR }' RS=';' File.txt
TABa:BatchID=A100:TagId=T100:ProcId=1
TABb:BatchID=B100:TagId=T200:ProcId=2
TABc:BatchID=A200:TagId=T300:ProcId=3
TABd:BatchID=B200:TagId=T400:ProcId=4
How it works
-F:
This sets the field separator to a colon: :.
sub(/\n/,"")
This removes newline characters.
printf "TAB%c:BatchID=%s:TagId=%s:ProcId=%i\n",(NR+96),$1,$2,NR
This does all the work. It makes use of the record number, NR, and the first and second fields and prints the output that you want.
RS=';'
This tells awk to use a semicolon, ;, as the record separator.

How do I convert a tab-separated values (TSV) file to a comma-separated values (CSV) file in BASH?

I have some TSV files that I need to convert to CSV files. Is there any solution in BASH, e.g. using awk, to convert these? I could use sed, like this, but am worried it will make some mistakes:
sed 's/\t/,/g' file.tsv > file.csv
Quotes needn't be added.
How can I convert a TSV to a CSV?

Update: The following solutions are not generally robust, although they do work in the OP's specific use case; see the bottom section for a robust, awk-based solution.
To summarize the options (interestingly, they all perform about the same):
tr:
devnull's solution (provided in a comment on the question) is the simplest:
tr '\t' ',' < file.tsv > file.csv
sed:
The OP's own sed solution is perfectly fine, given that the input contains no quoted strings (with potentially embedded \t chars.):
sed 's/\t/,/g' file.tsv > file.csv
The only caveat is that on some platforms (e.g., macOS) the escape sequence \t is not supported, so a literal tab char. must be spliced into the command string using ANSI quoting ($'\t'):
sed 's/'$'\t''/,/g' file.tsv > file.csv
awk:
The caveat with awk is that FS - the input field separator - must be set to \t explicitly - the default behavior would otherwise strip leading and trailing tabs and replace interior spans of multiple tabs with only a single ,:
awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
Note that simply assigning $1 to itself causes awk to rebuild the input line using OFS - the output field separator; this effectively replaces all \t chars. with , chars. print then simply prints the rebuilt line.
Robust awk solution:
As A. Rabus points out, the above solutions do not handle unquoted input fields that themselves contain , characters correctly - you'll end up with extra CSV fields.
The following awk solution fixes this, by enclosing such fields in "..." on demand (see the non-robust awk solution above for a partial explanation of the approach).
If such fields also have embedded " chars., these are escaped as "", in line with RFC 4180.Thanks, Wyatt Israel.
awk 'BEGIN { FS="\t"; OFS="," } {
rebuilt=0
for(i=1; i<=NF; ++i) {
if ($i ~ /,/ && $i !~ /^".*"$/) {
gsub("\"", "\"\"", $i)
$i = "\"" $i "\""
rebuilt=1
}
}
if (!rebuilt) { $1=$1 }
print
}' file.tsv > file.csv
$i ~ /[,"]/ && $i !~ /^".*"$/ detects any field that contains , and/or " and isn't already enclosed in double quotes
gsub("\"", "\"\"", $i) escapes embedded " chars. by doubling them
$i = "\"" $i "\"" updates the result by enclosing it in double quotes
As stated before, updating any field causes awk to rebuild the line from the fields with the OFS value, i.e., , in this case, which amounts to the effective TSV -> CSV conversion; flag rebuilt is used to ensure that each input record is rebuilt at least once.

This can also be achieved with Perl:
In order to pipe the results to a new output file you can use the following:
perl -wnlp -e 's/\t/,/g;' input_file.tsv > output_file.csv
If you'd like to edit the file in place, you can invoke the -i option:
perl -wnlpi -e 's/\t/,/g;' input_file.txt
If by some chance you find that what you are dealing with is not actually tabs, but instead multiple spaces, you can use the following to replace each occurrence of two or more spaces with a comma:
perl -wnlpi -e 's/\s+/,/g;' input_file
Keep in mind that \s represents any whitespace character, including spaces, tabs or newlines and cannot be used in the replacement string.

Using awk works for me
converting tsv to csv
awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' file.tsv > file.csv
or converting csv to tsv
awk 'BEGIN { FS=","; OFS="\t" } {$1=$1; print}' file.csv > file.tsv

The tr command :
tr '\t' ',' < file.tsv > file.csv
is simple and gave absolutely correct and very quick results for me even on a really large file (approx 10 GB).

You can simply use the power of sed in shell:
sed -r 's/\t/","/g' file.tsv|sed -r 's/(^|$)/"/g' > file.csv
In general, the above command turns Your tsv file into csv. However the tsv file may contain numerical fields. in this case, they shouldn't be surrounded by " like "123456". So we need another phase by which such double quotes are removed. The final solution:
sed -r 's/\t/","/g' file.tsv|sed -r 's/(^|$)/"/g'|sed -r 's/"([0-9]+)"/\1/g' > file.csv

Trim leading and trailing spaces from a string in awk

I'm trying to remove leading and trailing space in 2nd column of the below input.txt:
Name, Order  
Trim, working
cat,cat1
I have used the below awk to remove leading and trailing space in 2nd column but it is not working. What am I missing?
awk -F, '{$2=$2};1' input.txt
This gives the output as:
Name, Order  
Trim, working
cat,cat1
Leading and trailing spaces are not removed.

If you want to trim all spaces, only in lines that have a comma, and use awk, then the following will work for you:
awk -F, '/,/{gsub(/ /, "", $0); print} ' input.txt
If you only want to remove spaces in the second column, change the expression to
awk -F, '/,/{gsub(/ /, "", $2); print$1","$2} ' input.txt
Note that gsub substitutes the character in // with the second expression, in the variable that is the third parameter - and does so in-place - in other words, when it's done, the $0 (or $2) has been modified.
Full explanation:
-F, use comma as field separator
(so the thing before the first comma is $1, etc)
/,/ operate only on lines with a comma
(this means empty lines are skipped)
gsub(a,b,c) match the regular expression a, replace it with b,
and do all this with the contents of c
print$1","$2 print the contents of field 1, a comma, then field 2
input.txt use input.txt as the source of lines to process
EDIT I want to point out that #BMW's solution is better, as it actually trims only leading and trailing spaces with two successive gsub commands. Whilst giving credit I will give an explanation of how it works.
gsub(/^[ \t]+/,"",$2); - starting at the beginning (^) replace all (+ = zero or more, greedy)
consecutive tabs and spaces with an empty string
gsub(/[ \t]+$/,"",$2)} - do the same, but now for all space up to the end of string ($)
1 - ="true". Shorthand for "use default action", which is print $0
- that is, print the entire (modified) line

remove leading and trailing white space in 2nd column
awk 'BEGIN{FS=OFS=","}{gsub(/^[ \t]+/,"",$2);gsub(/[ \t]+$/,"",$2)}1' input.txt
another way by one gsub:
awk 'BEGIN{FS=OFS=","} {gsub(/^[ \t]+|[ \t]+$/, "", $2)}1' infile

Warning by #Geoff: see my note below, only one of the suggestions in this answer works (though on both columns).
I would use sed:
sed 's/, /,/' input.txt
This will remove on leading space after the , .
Output:
Name,Order
Trim,working
cat,cat1
More general might be the following, it will remove possibly multiple spaces and/or tabs after the ,:
sed 's/,[ \t]\?/,/g' input.txt
It will also work with more than two columns because of the global modifier /g
#Floris asked in discussion for a solution that removes trailing and and ending whitespaces in each colum (even the first and last) while not removing white spaces in the middle of a column:
sed 's/[ \t]\?,[ \t]\?/,/g; s/^[ \t]\+//g; s/[ \t]\+$//g' input.txt
*EDIT by #Geoff, I've appended the input file name to this one, and now it only removes all leading & trailing spaces (though from both columns). The other suggestions within this answer don't work. But try: " Multiple spaces , and 2 spaces before here " *
IMO sed is the optimal tool for this job. However, here comes a solution with awk because you've asked for that:
awk -F', ' '{printf "%s,%s\n", $1, $2}' input.txt
Another simple solution that comes in mind to remove all whitespaces is tr -d:
cat input.txt | tr -d ' '

I just came across this. The correct answer is:
awk 'BEGIN{FS=OFS=","} {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2)} 1'

just use a regex as a separator:
', *' - for leading spaces
' *,' - for trailing spaces
for both leading and trailing:
awk -F' *,? *' '{print $1","$2}' input.txt

Simplest solution is probably to use tr
$ cat -A input
^I Name, ^IOrder $
Trim, working $
cat,cat1^I
$ tr -d '[:blank:]' < input | cat -A
Name,Order$
Trim,working$
cat,cat1

The following seems to work:
awk -F',[[:blank:]]*' '{$2=$2}1' OFS="," input.txt

If it is safe to assume only one set of spaces in column two (which is the original example):
awk '{print $1$2}' /tmp/input.txt
Adding another field, e.g. awk '{print $1$2$3}' /tmp/input.txt will catch two sets of spaces (up to three words in column two), and won't break if there are fewer.
If you have an indeterminate (large) number of space delimited words, I'd use one of the previous suggestions, otherwise this solution is the easiest you'll find using awk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ignore comma after backslash in a line in a text file using awk or sed - bash

This might work for you (GNU sed): sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.

Use gawk's FPAT: awk -v FPAT='(\\\\.|[^,\\\\])+' '{print $3}' file #list_of_sports #football #tennis\,ping_pong\,chess then use gnusub to replace the backslashes: awk -v FPAT='(\\\\.|[^,\\\\])+' '{print gensub("\\\\", "", "g", $3)}' file #list_of_sports #football #tennis,ping_pong,chess

Related

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

AWK Finding a way to print lines containing a word from a comma separated string

Unix row to column format with string prefix and post fix

How do I convert a tab-separated values (TSV) file to a comma-separated values (CSV) file in BASH?

Trim leading and trailing spaces from a string in awk

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ignore comma after backslash in a line in a text file using awk or sed - bash

This might work for you (GNU sed): sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.

Use gawk's FPAT: awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file #list_of_sports #football #tennis\,ping_pong\,chess then use gnusub to replace the backslashes: awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file #list_of_sports #football #tennis,ping_pong,chess

Related

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

AWK Finding a way to print lines containing a word from a comma separated string

Unix row to column format with string prefix and post fix

How do I convert a tab-separated values (TSV) file to a comma-separated values (CSV) file in BASH?

Trim leading and trailing spaces from a string in awk

Categories

Resources

Use gawk's FPAT: awk -v FPAT='(\\\\.|[^,\\\\])+' '{print $3}' file #list_of_sports #football #tennis\,ping_pong\,chess then use gnusub to replace the backslashes: awk -v FPAT='(\\\\.|[^,\\\\])+' '{print gensub("\\\\", "", "g", $3)}' file #list_of_sports #football #tennis,ping_pong,chess