Replace pipe with comma except between curly braces in CSV in bash - bash

Need some solution to replace pipe with comma in specific column of CSV file, which is also having some key value as pipe separated strings (could be any in number, one or more).
Basically need to replace pipe which is not within curly braces i.e.{subStringX441|subStringX442|subStringX443|subStringX444} should remain untouched.
Can't use simple sed -i -e 's\|\,\g' filename as it will replace all pipes.
Input:
column1,column2,column3,column4,column5,column6,column7
stringX1,stringX2,stringX3,stringX41|stringX42|stringX43|stringX44={subStringX441|subStringX442|subStringX443|subStringX444}|stringX45,stringX5,stringX6,stringX7
stringY1,stringY2,stringY3,stringY41|stringY42|stringY43|stringY44={subStringY441|subStringY442|subStringY443}|stringY45,stringY5,stringY6,stringY7
Desired Output:
column1,column2,column3,column4a,column4b,column4c,column4d,column4e,column5,column6,column7
stringX1,stringX2,stringX3,stringX41,stringX42,stringX43,stringX44={subStringX441|subStringX442|subStringX443|subStringX444},stringX45,stringX5,stringX6,stringX7
stringY1,stringY2,stringY3,stringY41,stringY42,stringY43,stringY44={subStringY441|subStringY442|subStringY443},stringY45,stringY5,stringY6,stringY7

Using sed
$ sed 's/\({[^}]*\)\||/,\1/g;s/,{/{/;1s/column4/&a,&b,&c,&d,&e/' input_file
column1,column2,column3,column4a,column4b,column4c,column4d,column4e,column5,column6,column7
stringX1,stringX2,stringX3,stringX41,stringX42,stringX43,stringX44={subStringX441|subStringX442|subStringX443|subStringX444},stringX45,stringX5,stringX6,stringX7
stringY1,stringY2,stringY3,stringY41,stringY42,stringY43,stringY44={subStringY441|subStringY442|subStringY443},stringY45,stringY5,stringY6,stringY7

Regular expressions (in strict sense) are not enough for dealing with balanced bracket (last imply at least Chomsky Type-2). I would use GNU AWK for this task following way, let file.txt content be
stringY1,stringY2,stringY3,stringY41|stringY42|stringY43|stringY44
{subStringY441|subStringY442|subStringY443}|stringY45,stringY5,stringY6,stringY7
then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=1;i<=NF;i+=1){if($i=="{"){inside=1};if($i=="}"){inside=0};if(!inside && $i=="|"){$i=","}};print}' file.txt
output
stringY1,stringY2,stringY3,stringY41,stringY42,stringY43,stringY44
{subStringY441|subStringY442|subStringY443},stringY45,stringY5,stringY6,stringY7
Explanation: I inform GNU AWK that any single character is to be treated as field using FPAT variable and output field seperator is empty string using OFS variable. For every line I go through subsequent fields (i.e. characters) using for loop, if character is { then I set variable inside to 1, if character is } then I set variable to 0, then if we are not (!) inside and (&&) character is | change it to ,. After processing all characters in line I print.
DISCLAIMER this solution assumes that curly brackets are never nested and every { has matching } in given line.
(tested in gawk 4.2.1)

This might work for you (GNU sed):
sed ':a;s/\({[^|}]*\)|\([^}]*}\)/\1\n\2/g;ta;y/\n|/|,/' file
Replace |'s between {...}'s with newlines, then translate newlines to |'s and |'s to ,'s.

Related

AWK match exact string inside square brackets

I have a file similar to the below-illustrated data.
https://www.test.example.com [503]
https://www.tst.example.com [403]
https://www.tt.example.com [302]
I want to fetch lines that match with the second column. For example, lines matching [403] should print only https://www.tst.example.com.
I tried escaping the square brackets with the below command, which gave me a warning.
$ awk -F "$2 == '\[403]\'" file.txt
awk: warning: escape sequence `\[' treated as plain `['
awk: warning: escape sequence `\'' treated as plain `''
You are mixing regular expressions and plain strings. [ is a regex special character, but you are not using a regex here, just a literal string comparison. You don't need any escaping at all (though you might want to reverse the usage of single and double quotes for simplicity, unless you are actually using Windows).
awk '$2 == "[403]"' file.txt
In basically all the Unix shells, the double quotes you used don't protect dollar signs, so $2 would be substituted by the shell, probably with nothing, or else with some unrelated string (whatever got passed in as the second command-line argument to the shell).
The -F option, if present, requires an argument; but based on your example data, the default field separator - any sequence of whitespace - should work fine. If you want to force it to e.g. a single space, try -F ' '.
Could you please try following, written and tested with shown samples in GNU awk.
awk -F'([[:space:]]*)?\\[|\\]([[:space:]]*)?' '$2=="403"{print $1}' Input_file
Explanation: Setting field separator as either spaces(optional)[ OR [spaces(optional) for all lines. Then checking if 2nd field is 403 then print the first field as per OP's request.
Will do what you want, with the benefit of allowing you to pass the desired code as an argument, rather than having it hardcoded into the awk script.
awk -v http_code=403 '$2 == "["http_code"]"' file.txt

sed print more than one matches in a line

I have a file, including some strings and variables, like:
${cat.mouse.dog}
bird://localhost:${xfire.port}/${plfservice.url}
bird://localhost:${xfire.port}/${spkservice.synch.url}
bird://localhost:${xfire.port}/${spkservice.asynch.request.url}
${soabp.protocol}://${hpc.reward113.host}:${hpc.reward113.port}
${configtool.store.folder}/config/hpctemplates.htb
I want to print all the strings between "{}". In some lines there are more than one such string and in this case they should remain in the same line. The output should be:
cat.mouse.dog
xfire.port plfservice.url
xfire.port spkservice.synch.url
xfire.port spkservice.asynch.request.url
soabp.protocol hpc.reward113.host hpc.reward113.port
configtool.store.folder
I tried the following:
sed -n 's/.*{//;s/}.*//p' filename
but it printed only the last occurrence of each line. How can I get all the occurrences, remaining in the same line, as in the original file?
This might work for you (GNU sed):
sed -n 's/${/\n/g;T;s/[^\n]*\n\([^}]*\)}[^\n]*/\1 /g;s/ $//p' file
Replace all ${ by newlines and if there are non then move on as there is nothing to process. If there are newlines then remove non-newline characters to the left and non-newline characters to the right of the next } globally. To finish off remove the extra space introduced in the RHS of the global substitution.
If you're not against awk, you can try the following:
awk -v RS='{|}' -v ORS=' ' '/\n/{printf "\n"} (NR+1)%2' file
The record separator RS is set to either { or }. This splits the wanted pattern from the rest.
The script then displays 1 record out of 2 with the statement (NR+1)%2.
In order to keep the alignment as expected, the output record separator is set to a space ORS=' ' and everytime a newline is encountered this statement /\n/{printf "\n"} inserts one.

unterminated address regex while using sed

I am trying to use the sed command to find and print the number that appears between "\MP2=" and "\" in a portion of a line that appears like this in a large .log file
\MP2=-193.0977448\
I am using the command below and getting the following error:
sed "/\MP2=/,/\/p" input.log
sed: -e expression #1, char 12: unterminated address regex
Advice on how to alter this would be greatly appreciated!
Superficially, you just need to double up the backslashes (and it's generally best to use single quotes around the sed program):
sed '/\\MP2=/,/\\/p' input.log
Why? The double-backslash is necessary to tell sed to look for one backslash. The shell also interprets backslashes inside double quoted strings, which complicates things (you'd need to write 4 backslashes to ensure sed sees 2 and interprets it as 'look for 1 backslash') — using single quoted strings avoids that problem.
However, the /pat1/,/pat2/ notation refers to two separate lines. It looks like you really want:
sed -n '/\\MP2=.*\\/p' input.log
The -n suppresses the default printing (probably a good idea on the first alternative too), and the pattern looks for a single line containing \MP2= followed eventually by a backslash.
If you want to print just the number (as the question says), then you need to work a little harder. You need to match everything on the line, but capture just the 'number' and remove everything except the number before printing what's left (which is just the number):
sed -n '/.*\\MP2=\([^\]*\)\\.*/ s//\1/p' input.log
You don't need the double backslash in the [^\] (negated) character class, though it does no harm.
If the starting and ending pattern are on the same line, you need a substitution. The range expression /r1/,/r2/ is true from (an entire) line which matches r1, through to the next entire line which matches r2.
You want this instead;
sed -n 's/.*\\MP2=\([^\\]*\)\\.*/\1/p' file
This extracts just the match, by replacing the entire line with just the match (the escaped parentheses create a group which you can refer back to in the substitution; this is called a back reference. Some sed dialects don't want backslashes before the grouping parentheses.)
awk is a better tool for this:
awk -F= '$1=="MP2" {print $2}' RS='\' input.log
Set the record separator to \ and the field separator to '=', and it's pretty trivial.

Ignoring lines with blank or space after character using sed

I am trying to use sed to extract some assignments being made in a text file. My text file looks like ...
color1=blue
color2=orange
name1.first=Ahmed
name2.first=Sam
name3.first=
name4.first=
name5.first=
name6.first=
Currently, I am using sed to print all the strings after the name#.first's ...
sed 's/name.*.first=//' file
But of course, this also prints all of the lines with no assignment ...
Ahmed
Sam
# I'm just putting this comment here to illustrate the extra carriage returns above; please ignore it
Is there any way I can get sed to ignore the lines with blank or whitespace only assignments and store this to an array? The number of assigned name#.first's is not known, nor are the number of assignments of each type in general.
This is a slight variation on sputnick's answer:
sed -n '/^name[0-9]\.first=\(.\+\)/ s//\1/p'
The first part (/^name[0-9]\.first=\(.\+\)/) selects the lines you want to pass to the s/// command. The empty pattern in the s command re-uses the previous regular expression and the replacement portion (\1) replaces the entire match with the contents of the first parenthesized part of the regex. Use the -n and p flags to control which lines are printed.
sed -n 's/^name[0-9]\.\w\+=\(\w\+\)/\1/p' file
Output
Ahmed
Sam
Explainations
the -n switch suppress the default behavior of sed : printing all lines
s/// is the skeleton for a substitution
^ match the beginning of a line
name literal string
[0-9] a digit alone
\.\w\+ a literal dot (without backslash means any character) followed by a word character [a-zA-Z0-9_] al least one : \+
( ) is a capturing group and \1 is the captured group

Explained shell statement

The following statement will remove line numbers in a txt file:
cat withLineNumbers.txt | sed 's/^.......//' >> withoutLineNumbers.txt
The input file is created with the following statement (this one i understand):
nl -ba input.txt >> withLineNumbers.txt
I know the functionality of cat and i know the output is written to the 'withoutLineNumbers.txt' file. But the part of '| sed 's/^.......//'' is not really clear to me.
Thanks for your time.
That sed regular expression simply removes the first 7 characters from each line. The regular expression ^....... says "Any 7 characters at the beginning of the line." The sed argument s/^.......// substitutes the above regular expression with an empty string.
Refer to the sed(1) man page for more information.
that sed statement says the delete the first 7 characters. a dot "." means any character. There is an even easier way to do this
awk '{print $2}' withLineNumbers.txt
you just have to print out the 2nd column using awk. No need to use regex
if your data has spaces,
awk '{$1="";print substr($0,2)}' withLineNumbers.txt
sed is doing a search and replace. The 's' means search, the next character ('/') is the seperator, the search expression is '^.......', and the replace expression is an empty string (i.e. everything between the last two slashes).
The search is a regular expression. The '^' means match start of line. Each '.' means match any character. So the search expression matches the first 7 characters of each line. This is then replaced with an empty string. So what sed is doing is removing the first 7 characters of each line.
A more simple way to achieve the same think could be:
cut -b8- withLineNumbers.txt > withoutLineNumbers.txt

Resources