Using BASH sed command to strip a line - bash

I have one line in a html file which i located using
grep -m 1 'argument'
That line looks a lot like this
<tag option="something" option="something"><span option="something"> Text1 </span> - <span option="something"> Text2 </span></tag>
I need to extract Text 1 and Text 2 using seperate lines, what do I do? I get that I need to use sed, I have removed tag and span at the begginng leaving me with
Text1 </span> - <span...........</tag>
but I need only Text1 and i realy dont know how to remove that non-static Text2

If the lines are exactly always looking like the example you provide, you can do it with a regexp.
But in all other cases, you should really use a XML parser instead (for example, use perl : twig, or others)
So here is a regexp, but you've been warned ^^
#replace each <...> with "|", so you can easily separate each fields
sed -e 's/<[^>]*>/|/g'
You can then fetch each section by using that new "simple" separator, |:
grep 'argument' | sed -e 's/<[^>]*>/|/g' | awk -F'|' '{print $3}' #shows Text1. Change $3 to $5 to fetch the Text2

Here is much shorter way to do that using grep and perl regular expressions.
$ cat testfile # I've placed your line in this file
<tag option="something" option="something"><span option="something"> Text1 </span> - <span option="something"> Text2 </span></tag>
$ grep -Po '<span[^>]*>\K[^<]*' testfile
Text1
Text2
But if you want to get exactly Text1 and Text2 you need a bit more complicated regexp. And here it is:
$ grep -Po '<span[^>]*>( )?\s*\K.*?(?=\s*( )?</span>)' testfile
Text1
Text2
Some explanation:
This regex is using lookaround syntax or zero-width assertions. You can read about it here
\K might be unfamiliar too. It is very similar to zero-width assertions and is greatly explained here. Here is a quote from that link:
There is a special form of this construct, called \K , which causes the regex engine to "keep" everything it had matched prior to the \K and not include it in $& . This effectively provides variable-length look-behind. The use of \K inside of another look-around assertion is allowed, but the behaviour is currently not well defined.
Ok, but why?
One of the greatest things in this approach is that you're already using grep, you can possibly combine this regex with your search pattern so you will only need one grep command, unlike Oliver`s answer which uses grep, sed and awk.
But still, there are special tools to parse xml, please use them instead of this regex porn.

Related

Sed command to replace a word in a tab and space-separated line of text

I have the below fields separated by tabs and spaces in file "text.txt". I want to use sed command to find "^#\t*\stext1\t\stext2\t\s100" and replace it with "^#\t\stext1\t\stext2\t\s*1000"
<Field1> <Field2> <Field3> <Field4>
# text1 text2 100
$ text3 text4 200
I have tried using the below sed command:
sed -i "/^\s*\#\s+text1\s+text2\s*/c\#/\t/\ttext1/\ttext2/\t/\t1000" /text.txt
However, nothing is getting replaced in the file.
Your main issue is that you are using unescaped + in the POSIX BRE regex that is treated as a literal + symbol.
You need to use -E option to enable POSIX ERE syntax, where + is treated as a quantifier. Besides, you have several redundant / chars in the replacement, you need to remove them.
You can use
sed -E -i "/^\s*\#\s+text1\s+text2\s*/c\#\t\ttext1\ttext2\t1000" file
See the online demo.

Adding a new line to a text file after 5 occurrences of a comma in Bash

I have a text file that is basically one giant excel file on one line in a text file. An example would be like this:
Name,Age,Year,Michael,27,2018,Carl,19,2018
I need to change the third occurance of a comma into a new line so that I get
Name,Age,Year
Michael,27,2018
Carl,19,2018
Please let me know if that is too ambiguous and as always thank you in advance for all the help!
With Gnu sed:
sed -E 's/(([^,]*,){2}[^,]*),/\1\n/g'
To change the number of fields per line, change {2} to one less than the number of fields. For example, to change every fifth comma (as in the title of your question), you would use:
sed -E 's/(([^,]*,){4}[^,]*),/\1\n/g'
In the regular expression, [^,]*, is "zero or more characters other than , followed by a ,; in other words, it is a single comma-delimited field. This won't work if the fields are quoted strings with internal commas or newlines.
Regardless of what Linux's man sed says, the -E flag is an extension to Posix sed, which causes sed to use extended regular expressions (EREs) rather than basic regular expressions (see man 7 regex). -E also works on BSD sed, used by default on Mac OS X. (Thanks to #EdMorton for the note.)
With GNU awk for multi-char RS:
$ awk -v RS='[,\n]' '{ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
With any awk:
$ awk -v RS=',' '{sub(/\n$/,""); ORS=(NR%3 ? "," : "\n")} 1' file
Name,Age,Year
Michael,27,2018
Carl,19,2018
Try this:
$ cat /tmp/22.txt
Name,Age,Year,Michael,27,2018,Carl,19,2018,Nooka,35,1945,Name1,11,19811
$ echo "Name,Age,Year"; grep -o "[a-zA-Z][a-zA-Z0-9]*,[1-9][0-9]*,[1-9][0-9]\{3\}" /tmp/22.txt
Michael,27,2018
Carl,19,2018
Nooka,35,1945
Name1,11,1981
Or, ,[1-9][0-9]\{3\} if you don't want to put [0-9] 3 more times for the YYYY part.
PS: This solution will give you only YYYY for the year (even if the data for YYYY is 19811 (typo mistakes if any), you'll still get 1981
You are looking for 3 fragments, each without a comma and separated by a comma.
The last fields can give problems (not ending with a comma and mayby only two fields.
The next command looks fine.
grep -Eo "([^,]*[,]{0,1}){0,3}" inputfile
This might work for you (GNU sed):
sed 's/,/\n/3;P;D' file
Replace every third , with a newline, print ,delete the first line and repeat.

parse word from html file

I am having a lot of trouble trying to extract a word from an html file. The line in the html file appears like this:
<span id="result">WORD</span>
I am trying to get the WORD out but I can't figure it out. So far I've got:
grep 'span id="result"' FILE
Which just gets me the line. I've also tried:
sed -n '/<span id="result">/,/<\/span>/p' FILE
which didn't work either.
I know this is probably a very simple question, but I'm just beginning so I could really use some help.
Do not use regex to parse html.
Use a html parser.
My Xidel has the shortest syntax for this:
xidel FILE -e "#result"
This is a task for awk
I do guess you have other line in same files so a search for span id is a must.
echo "<span id="result">WORD</span>" | awk -F"[<>]" '/span id/ {print $3}'
WORD
You can try
awk -f ext.awk input.html
where input.html is your input html file, and ext.awk is
{
line=line $0 RS
}
END {
match (line,/<span id="result">([^<]*)<\/span>/,a)
print a[1]
}
This will extract the contents across line breaks..
Use grep with backward reference:
grep -Po '(?<=<span id="result">)\w+'
The expression between parenthèses is a backward reference; it is not captured but serves as test for the following regex part: if the expression appears, the captured pattern is only \w+ here. Add option -o for outputting only the word; option -P enables forward and backward references.
If you want to modifiy this regex, please note that with grep, a backward reference must have a fixed size.

Sed is not replacing all occurrences of pattern

I've got a the following variable LINES with the format date;album;song;duration;singer;author;genre.
August 2013;MDNA;Falling Free;00:31:40;Madonna;Madonna;Pop
August 2013;MDNA;I don't give a;00:45:40;Madonna;Madonna;Pop
August 2013;MDNA;I'm a sinner;01:00:29;Madonna;Madonna;Pop
August 2013;MDNA;Give Me All Your Luvin';01:15:02;Madonna;Madonna;Pop
I want to output author-song, so I made this script:
echo $LINES | sed s_"^[^;]*;[^;]*;\([^;]*\);[^;]*;[^;]*;\([^;]*\)"_"\2-\1"_g
The desired output is:
Madonna-Falling Free
Madonna-I don't give a
Madonna-I'm a sinner
Madonna-Give Me All Your Luvin'
However, I am getting this:
Madonna-Falling Free;Madonna;Pop August 2013;MDNA;I don't give a;00:45:40;Madonna;Madonna;Pop August 2013;MDNA;I'm a sinner;01:00:29;Madonna;Madonna;Pop August 2013;MDNA;Give Me All Your Luvin';01:15:02;Madonna;Madonna;Pop
Why?
EDIT: I need to use sed.
When I run your sed script on your input, I get this output:
Madonna-Falling Free;Pop
Madonna-I don't give a;Pop
Madonna-I'm a sinner;Pop
Madonna-Give Me All Your Luvin';Pop
which is fine except for the extra ;Pop - you just need to add .*$ to the end of your regex so that the entire line is replaced.
Based on your reported output, I'm guessing your input file is using a different newline convention from what sed expects.
In any case, this is a pretty silly thing to use sed for. Much better with awk, for instance:
awk 'BEGIN {FS=";";OFS="-"} {print $5,$3}'
or, slightly more tersely,
awk -F\; -vOFS=- '{print $5,$3}'
If you want sed to see more than one line of input, you must quote the variable to echo:
echo "$LINES" | sed ...
Note that I'm not even going to try to evaluate the correctness of your sed script; using sed here is a travesty, given that awk is so much better suited to the task.
It looks like sed is viewing your entire sample text as a single line. So it is performing the operation requested and then leaving the rest unchanged.
I would look into the newline issue first. How are you populating $LINES?
You should also add to the pattern that seventh field in your input (genre), so that the expression actually does consume all of the text that you want it to. And perhaps anchor the end of the pattern on $ or \b (word boundary) or \s (a spacey character) or \n (newline).
If your format is absolutely permanent, just try below:
echo $line | sed 's#.*;.*;\(.*\);.*;.*;\(.*\);.*#\2-\1#'

Explained shell statement

The following statement will remove line numbers in a txt file:
cat withLineNumbers.txt | sed 's/^.......//' >> withoutLineNumbers.txt
The input file is created with the following statement (this one i understand):
nl -ba input.txt >> withLineNumbers.txt
I know the functionality of cat and i know the output is written to the 'withoutLineNumbers.txt' file. But the part of '| sed 's/^.......//'' is not really clear to me.
Thanks for your time.
That sed regular expression simply removes the first 7 characters from each line. The regular expression ^....... says "Any 7 characters at the beginning of the line." The sed argument s/^.......// substitutes the above regular expression with an empty string.
Refer to the sed(1) man page for more information.
that sed statement says the delete the first 7 characters. a dot "." means any character. There is an even easier way to do this
awk '{print $2}' withLineNumbers.txt
you just have to print out the 2nd column using awk. No need to use regex
if your data has spaces,
awk '{$1="";print substr($0,2)}' withLineNumbers.txt
sed is doing a search and replace. The 's' means search, the next character ('/') is the seperator, the search expression is '^.......', and the replace expression is an empty string (i.e. everything between the last two slashes).
The search is a regular expression. The '^' means match start of line. Each '.' means match any character. So the search expression matches the first 7 characters of each line. This is then replaced with an empty string. So what sed is doing is removing the first 7 characters of each line.
A more simple way to achieve the same think could be:
cut -b8- withLineNumbers.txt > withoutLineNumbers.txt

Resources