Sed is not replacing all occurrences of pattern - bash

I've got a the following variable LINES with the format date;album;song;duration;singer;author;genre.
August 2013;MDNA;Falling Free;00:31:40;Madonna;Madonna;Pop
August 2013;MDNA;I don't give a;00:45:40;Madonna;Madonna;Pop
August 2013;MDNA;I'm a sinner;01:00:29;Madonna;Madonna;Pop
August 2013;MDNA;Give Me All Your Luvin';01:15:02;Madonna;Madonna;Pop
I want to output author-song, so I made this script:
echo $LINES | sed s_"^[^;]*;[^;]*;\([^;]*\);[^;]*;[^;]*;\([^;]*\)"_"\2-\1"_g
The desired output is:
Madonna-Falling Free
Madonna-I don't give a
Madonna-I'm a sinner
Madonna-Give Me All Your Luvin'
However, I am getting this:
Madonna-Falling Free;Madonna;Pop August 2013;MDNA;I don't give a;00:45:40;Madonna;Madonna;Pop August 2013;MDNA;I'm a sinner;01:00:29;Madonna;Madonna;Pop August 2013;MDNA;Give Me All Your Luvin';01:15:02;Madonna;Madonna;Pop
Why?
EDIT: I need to use sed.

When I run your sed script on your input, I get this output:
Madonna-Falling Free;Pop
Madonna-I don't give a;Pop
Madonna-I'm a sinner;Pop
Madonna-Give Me All Your Luvin';Pop
which is fine except for the extra ;Pop - you just need to add .*$ to the end of your regex so that the entire line is replaced.
Based on your reported output, I'm guessing your input file is using a different newline convention from what sed expects.
In any case, this is a pretty silly thing to use sed for. Much better with awk, for instance:
awk 'BEGIN {FS=";";OFS="-"} {print $5,$3}'
or, slightly more tersely,
awk -F\; -vOFS=- '{print $5,$3}'

If you want sed to see more than one line of input, you must quote the variable to echo:
echo "$LINES" | sed ...
Note that I'm not even going to try to evaluate the correctness of your sed script; using sed here is a travesty, given that awk is so much better suited to the task.

It looks like sed is viewing your entire sample text as a single line. So it is performing the operation requested and then leaving the rest unchanged.
I would look into the newline issue first. How are you populating $LINES?
You should also add to the pattern that seventh field in your input (genre), so that the expression actually does consume all of the text that you want it to. And perhaps anchor the end of the pattern on $ or \b (word boundary) or \s (a spacey character) or \n (newline).

If your format is absolutely permanent, just try below:
echo $line | sed 's#.*;.*;\(.*\);.*;.*;\(.*\);.*#\2-\1#'

Related

Adding a new line to a text file after 5 occurrences of a comma in Bash

I have a text file that is basically one giant excel file on one line in a text file. An example would be like this:
Name,Age,Year,Michael,27,2018,Carl,19,2018
I need to change the third occurance of a comma into a new line so that I get
Name,Age,Year
Michael,27,2018
Carl,19,2018
Please let me know if that is too ambiguous and as always thank you in advance for all the help!
With Gnu sed:
sed -E 's/(([^,]*,){2}[^,]*),/\1\n/g'
To change the number of fields per line, change {2} to one less than the number of fields. For example, to change every fifth comma (as in the title of your question), you would use:
sed -E 's/(([^,]*,){4}[^,]*),/\1\n/g'
In the regular expression, [^,]*, is "zero or more characters other than , followed by a ,; in other words, it is a single comma-delimited field. This won't work if the fields are quoted strings with internal commas or newlines.
Regardless of what Linux's man sed says, the -E flag is an extension to Posix sed, which causes sed to use extended regular expressions (EREs) rather than basic regular expressions (see man 7 regex). -E also works on BSD sed, used by default on Mac OS X. (Thanks to #EdMorton for the note.)
With GNU awk for multi-char RS:
$ awk -v RS='[,\n]' '{ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
With any awk:
$ awk -v RS=',' '{sub(/\n$/,""); ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
Try this:
$ cat /tmp/22.txt
Name,Age,Year,Michael,27,2018,Carl,19,2018,Nooka,35,1945,Name1,11,19811
$ echo "Name,Age,Year"; grep -o "[a-zA-Z][a-zA-Z0-9]*,[1-9][0-9]*,[1-9][0-9]\{3\}" /tmp/22.txt
Michael,27,2018
Carl,19,2018
Nooka,35,1945
Name1,11,1981
Or, ,[1-9][0-9]\{3\} if you don't want to put [0-9] 3 more times for the YYYY part.
PS: This solution will give you only YYYY for the year (even if the data for YYYY is 19811 (typo mistakes if any), you'll still get 1981
You are looking for 3 fragments, each without a comma and separated by a comma.
The last fields can give problems (not ending with a comma and mayby only two fields.
The next command looks fine.
grep -Eo "([^,]*[,]{0,1}){0,3}" inputfile
This might work for you (GNU sed):
sed 's/,/\n/3;P;D' file
Replace every third , with a newline, print ,delete the first line and repeat.

Bash get last sentence in line

assume, we got the following variable containing a string:
text="All of this is one line. But it consists of multiple sentences. Those are separated by dots. I'd like to get this sentence."
I now need the last sentence "I'd like to get this sentence.". I tried using sed:
echo "$text" | sed 's/.*\.*\.//'
I thought it would delete everything up to the pattern .*.. It doesn't.
What's the issue here? I'm sure this can be resolved rather fast, unfortunatly I did not find any solution for this.
Using awk you can do:
awk -F '\\. *' '{print $(NF-1) "."}' <<< "$text"
I'd like to get this sentence.
Using sed:
sed -E 's/.*\.([^.]+\.)$/\1/' <<< "$text"
I'd like to get this sentence.
Don't forget the built-in
echo "${text##*. }"
This requires a space after the full stop, but the pattern is easy to adapt if you don't want that.
As for your failed attempt, the regex looks okay, but weird. The pattern \.*\. looks for zero or more literal periods followed by a literal period, i.e. effectively one or more period characters.

Delete all lines before last case of a string

How would I go about deleting all the lines before the last occurrence of a string. Like if I had a file that looked like
Icecream is good
And
Chocolate is good
And
They have lots of sugar
If I want all lines after and including the last occurrence of "And" what's the cleanest way to do this? Specifically, I want
And
They have lots of sugar
I was doing sed -n -E -e '/And/,$p' file but I see this gives me the first occurrence.
This might work for you (GNU sed):
sed -n '/And/h;//!H;$!d;x;//p' file
Replace anything in the hold space by the line containing And. Append all other lines to the hold space. At the end of the file, swap the pattern space for the hold space and print out the result as long it matches the required string And.
I know that you asked for sed and that Potong provided a good sed solution. But, for comparison, here is an awk solution:
$ awk 's{s=s"\n"$0;} /And/{s=$0;} END{print s;}' file
And
They have lots of sugar
How it works:
s{s=s"\n"$0;}
If the variable s is not empty, then add to it the current line, $0.
/And/{s=$0;}
If the current line contains And, then set s to the current line, $0.
END{print s;}
After we have reached the end of the file, print s.
$ tac file | awk '!f; /And/{f=1}' | tac
And
They have lots of sugar
$ awk 'NR==FNR{if(/And/)nr=NR;next} FNR>=nr' file file
And
They have lots of sugar

How to parse a config file using sed

I've never used sed apart from the few hours trying to solve this. I have a config file with parameters like:
test.us.param=value
test.eu.param=value
prod.us.param=value
prod.eu.param=value
I need to parse these and output this if REGIONID is US:
test.param=value
prod.param=value
Any help on how to do this (with sed or otherwise) would be great.
This works for me:
sed -n 's/\.us\././p'
i.e. if the ".us." can be replaced by a dot, print the result.
If there are hundreds and hundreds of lines it might be more efficient to first search for lines containing .us. and then do the string replacement... AWK is another good choice or pipe grep into sed
cat INPUT_FILE | grep "\.us\." | sed 's/\.us\./\./g'
Of course if '.us.' can be in the value this isn't sufficient.
You could also do with with the address syntax (technically you can embed the second sed into the first statement as well just can't remember syntax)
sed -n '/\(prod\|test\).us.[^=]*=/p' FILE | sed 's/\.us\./\./g'
We should probably do something cleaner. If the format is always environment.region.param we could look at forcing this only to occur on the text PRIOR to the equal sign.
sed -n 's/^\([^,]*\)\.us\.\([^=]\)=/\1.\2=/g'
This will only work on lines starting with any number of chars followed by '.' then 'us', then '.' and then anynumber prior to '=' sign. This way we won't potentially modify '.us.' if found within a "value"

Remove nth character from middle of string using Shell

I've been searching google for ever, and I cannot find an example of how to do this. I also do not grasp the concept of how to construct a regular expression for SED, so I was hoping someone could explain this to me.
I'm running a bash script against a file full of lines of text that look like this: 2222,H,73.82,04,07,2012
and I need to make them all look like this: 2222,H,73.82,04072012
I need to remove the last two commas, which are the 16th and 19th characters in the line.
Can someone tell me how to do that? I was going to use colrm, which is blessedly simple, but i can't seem to get that installed in CYGWIN. Please and thank you!
I'd use awk for this:
awk -F',' -v OFS=',' '{ print $1, $2, $3, $4$5$6 }' inputfile
This takes a CSV file and prints the first, second and third fields, each followed by the output field separator (",") and then the fourth, fifth and sixth fields concatenated.
Personally I find this easier to read and maintain than regular expression-based solutions in sed and it will cope well if any of your columns get wider (or narrower!).
This will work on any string and will remove only the last 2 commas:
sed -e 's/\(.*\),\([^,]*\),\([^,]*\)$/\1\2\3/' infile.txt
Note that in my sed variant I have to escape parenthesis, YMMV.
I also do not grasp the concept of how to construct a regular
expression for SED, so I was hoping someone could explain this to me.
The basic notation that people are telling you here is: s/PATTERN/REPLACEMENT/
Your PATTERN is a regular expression, which may contain parts that are in brackets. Those parts can then be referred to in the REPLACEMENT part of the command. For example:
> echo "aabbcc" | sed 's/\(..\)\(..\)\(..\)/\2\3\1/'
bbccaa
Note that in the version of sed I'm using defaults to the "basic" RE dialect, where the brackets in expressions need to be escaped. You can do the same thing in the "extended" dialect:
> echo "aabbcc" | sed -E 's/(..)(..)(..)/\2\3\1/'
bbccaa
(In GNU sed (which you'd find in Linux), you can get the same results with the -r options instead of -E. I'm using OS X.)
I should say that for your task, I would definitely follow Johnsyweb's advice and use awk instead of sed. Much easier to understand. :)
It should work :
sed -e 's~,~~4g' file.txt
remove 4th and next commas
echo "2222,H,73.82,04,07,2012" | sed -r 's/(.{15}).(..)./\1\2/'
Take 15 chars, drop one, take 2, drop one.
sed -e 's/(..),(..),(....)$/\1\2\3/' myfile.txt

Resources