How to convert separators using regex in bash - bash

How do I modify my bash file to achieve the expected result shown below ?
#!/bin/bash
filename=$1
var="$(<$filename)" | tr -d '\n'
sed -i 's/;/,/g' $var
Convert this input file
a,b;c^d"e}
f;g,h!;i8j-
To this output file
a,b,c,d,e,f,g,h,i,j

How to convert separators using regex in bash
You would, well, literally, do exactly that - convert any of the separators using regex. This consists of steps:
most importantly, figure out the exact definition of what consists of a "separator"
writing a regex for it
writing an algorithm for it
running and testing the code
For example, assuming a separator is a sequence of of any of \n,;^"}!8- characters, you could do:
sed -zi 's/[,;^"}!8-]\+/,/g; s/,$/\n/' input_file
Or similar with first tr '\n' , for example when -z is not available with your sed, and then pass the result of tr to sed. The second regex adds a trailing newline on the output instead of a trailing ,.
Additionally, in your code:
var is unset on sed line. Parts of | pipeline are running in a subshell.
var=$(<$filename) contains the contents of the file, whereas sed wants a filename as argument, not file contents.
var=.... | ... is pipeing the result of assignment to tr. The output of assignment is empty, so that line produces nothing, and its output is unused.
Remember to check bash scripts with shellcheck.

For a somewhat portable solution, maybe try
tr -cs A-Za-z , <input_file | sed '$s/,$/\n/' >output_file
The use of \n to force a final newline is still not entirely reliable; there are some sed versions which interpret the sequence as a literal n.
You'd move output_file back on top of input_file after this command if you want to replace the original.

Related

How to create a string of filenames separated by comma in shell script?

I am trying to create a string eg (file1.txt,file2.txt,file3.txt).
All these 3 files names are within a file.
ls file*.txt > lstfiles.txt
while read filename; do
filename+=$line","
done <lstfiles.txt
This returns me with output:
file1.txt,file2.txt,file3.txt,
How can I find the last iteration of the loop so I dont add another comma at the end.
Required output:
file1.txt,file2.txt,file3.txt
For your use case I would rather get rid of the while loop and combine sedand tr commands like so:
sed -e '$ ! s/$/,/g' lstfiles.txt | tr -d '\n'
Where sed command replace each line endings execept the last one with a comma and tr command remove the linebreaks.
Probably avoid using ls in scripts though.
printf '%s,' file*.txt |
sed 's/,$/\n/'
assuming your sed recognizes \n to be a newline, and copes with input which doesn't have a final newline.

Strip multiline markdown comment in bash

How do I strip multi-line markdown comments, such as the one below, in bash?
some text
<!-- QUESTION:
How do I remove everything
in-between these tags?
-->
some<!-- Including embedded single-line comments such as this --> text
I've tried sed -e 's/<!--((.*?)\n?)+-->//g' $1, which works only with single line, and cat $1 | tr '\n' '\r' | sed -e 's/<!--.*-->//g' | tr '\r' '\n', which removes everything after the first multiline comment.
<!--((.*?)\n?)+--> captures the required area in my text-editor, but
sed -e 's/<!--((.*?)\n?)+-->//g' $1 doesn't work as expected.
Other examples I can find that works with C++ comments are too complicated to decode.
You can accomplish this with a perl one-liner.
Perl Switches:
-0 sets the input record separator to the null character \0
-p prints the result of perl code
-e executes the following code
Inside the regular expression:
g flag means global (perform the replacement as many times as possible)
s flag means treat the input as a multi-line string
Match the characters `<!--` followed by anything up to the characters `-->`
including anything after that till the newline. Replace that with nothing.
In Action:
perl -0pe 's|<!--.+?-->.*?\n||gs;' input
Output:
some text
some text

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format [duplicate]

I can't figure how to tell sed dot match new line:
echo -e "one\ntwo\nthree" | sed 's/one.*two/one/m'
I expect to get:
one
three
instead I get original:
one
two
three
sed is line-based tool. I don't think these is an option.
You can use h/H(hold), g/G(get).
$ echo -e 'one\ntwo\nthree' | sed -n '1h;1!H;${g;s/one.*two/one/p}'
one
three
Maybe you should try vim
:%s/one\_.*two/one/g
If you use a GNU sed, you may match any character, including line break chars, with a mere ., see :
.
Matches any character, including newline.
All you need to use is a -z option:
echo -e "one\ntwo\nthree" | sed -z 's/one.*two/one/'
# => one
# three
See the online sed demo.
However, one.*two might not be what you need since * is always greedy in POSIX regex patterns. So, one.*two will match the leftmost one, then any 0 or more chars as many as possible, and then the rightmost two. If you need to remove one, then any 0+ chars as few as possible, and then the leftmost two, you will have to use perl:
perl -i -0 -pe 's/one.*?two//sg' file # Non-Unicode version
perl -i -CSD -Mutf8 -0 -pe 's/one.*?two//sg' file # S&R in a UTF8 file
The -0 option enables the slurp mode so that the file could be read as a whole and not line-by-line, -i will enable inline file modification, s will make . match any char including line break chars, and .*? will match any 0 or more chars as few as possible due to a non-greedy *?. The -CSD -Mutf8 part make sure your input is decoded and output re-encoded back correctly.
You can use python this way:
$ echo -e "one\ntwo\nthree" | python -c 'import re, sys; s=sys.stdin.read(); s=re.sub("(?s)one.*two", "one", s); print s,'
one
three
$
This reads the entire python's standard input (sys.stdin.read()), then substitutes "one" for "one.*two" with dot matches all setting enabled (using (?s) at the start of the regular expression) and then prints the modified string (the trailing comma in print is used to prevent print from adding an extra newline).
This might work for you:
<<<$'one\ntwo\nthree' sed '/two/d'
or
<<<$'one\ntwo\nthree' sed '2d'
or
<<<$'one\ntwo\nthree' sed 'n;d'
or
<<<$'one\ntwo\nthree' sed 'N;N;s/two.//'
Sed does match all characters (including the \n) using a dot . but usually it has already stripped the \n off, as part of the cycle, so it no longer present in the pattern space to be matched.
Only certain commands (N,H and G) preserve newlines in the pattern/hold space.
N appends a newline to the pattern space and then appends the next line.
H does exactly the same except it acts on the hold space.
G appends a newline to the pattern space and then appends whatever is in the hold space too.
The hold space is empty until you place something in it so:
sed G file
will insert an empty line after each line.
sed 'G;G' file
will insert 2 empty lines etc etc.
How about two sed calls:
(get rid of the 'two' first, then get rid of the blank line)
$ echo -e 'one\ntwo\nthree' | sed 's/two//' | sed '/^$/d'
one
three
Actually, I prefer Perl for one-liners over Python:
$ echo -e 'one\ntwo\nthree' | perl -pe 's/two\n//'
one
three
Below discussion is based on Gnu sed.
sed operates on a line by line manner. So it's not possible to tell it dot match newline. However, there are some tricks that can implement this. You can use a loop structure (kind of) to put all the text in the pattern space, and then do the operation.
To put everything in the pattern space, use:
:a;N;$!ba;
To make "dot match newline" indirectly, you use:
(\n|.)
So the result is:
root#u1804:~# echo -e "one\ntwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#
Note that in this case, (\n|.) matches newline and all characters. See below example:
root#u1804:~# echo -e "oneXXXXXX\nXXXXXXtwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#

Replace part of a string Shell Scripting

I have lines and want to do sed operation, on string which comes after it has read '|'character three times. How can I do this in Shell Script?
Input: aaaa|bbbbb|ccccc|hello
Desired Ouput: aaaa|bbbbb|ccccc|hel
This is be done on hello which is after three '|'
-> sed 's/({.3}).*/\1/g'
You don't specify what you want to do with the last field to transform "hello" into "hel". Here's one way:
sed -r 's/^(([^|]+\|){3})(...).*/\1\3/' file
([^|]+\|) denotes a pipe delimited field (with the pipe)
(([^|]+\|){3}) denotes three such fields
requires sed's -r option
on OSX or BSD-ish implementations of sed, use -E instead)
I capture the next three characters with (...)
then replace all with the first and third set of capturing parentheses
Use the cut command instead of sed:
$ echo "aaaa|bbbbb|ccccc|hello" | cut -d '|' -f 4
hello

Grep (fgrep) bash exact match end of line

I have the below example file
d41d8cd98f00b204e9800998ecf8427e /home/abid/Testing/FileNamesTest/apersand $ file
d41d8cd98f00b204e9800998ecf8427e /home/abid/Testing/FileNamesTest/file[with square brackets]
d41d8cd98f00b204e9800998ecf8427e /home/abid/Testing/FileNamesTest/~$tempfile
017a3635ccb76250b2036d6aea330c80 /home/abid/Testing/FileNamesTest/FileThree
217a3635ccb76250b2036d6aea330c80 /home/abid/Testing/FileNamesTest/FileThreeDays
d41d8cd98f00b204e9800998ecf8427e /home/abid/Testing/FileNamesTest/single quote's
I want to grep the last part of the file (the file name) but I'm after an exact match for the last part of the line (the file name)
grep FileThree$ files.md5
017a3635ccb76250b2036d6aea330c80 /home/abid/Testing/FileNamesTest/FileThree
gives back an exact match and doesnt find "FileThreeDays" which is what I'm after but because some of the file names contains square brackets it I'm having to use grep -F or fgrep. However using fgrep like the above doesnt work it returns nothing.
How can I exact match the last part of the line using fgrep whilst still honoring the special characters above ~ / $ / ' / [ ] etc...or any other method using maybe awk...
Further....
using fgrep withou return both these files I only want an exact match (using the use of the $ above with grep), but $ with fgrep doesnt return anything.
grep -F FileThree files.md5
017a3635ccb76250b2036d6aea330c80 /home/abid/Testing/FileNamesTest/FileThree
217a3635ccb76250b2036d6aea330c80 /home/abid/Testing/FileNamesTest/FileThreeDays
I can't tell all the details from your question, but it sounds like you can use grep and just escape the special characters: grep 'File\[Three\]Days$'
If you want to use fgrep, though, you can use some tr tricks to help you. If all you want is the filename (without the directory name), you can do something like
cat files.md5 | tr '/' '\n' | fgrep FileThreeDays
That tr command replaces slashes with newlines, so it will put each filename on its own line. That means that fgrep will only find the filename when it searches for FileThreeDays.
If you want the full filename with directory, it's a little trickier, but a similar approach will work. Assuming that there's always a double space between the SHA and the filename, and that there aren't any filenames with double spaces or tab characters in them, you can try something like this:
sed 's/ /\t' files.md5 | tr '\t' '\n' | fgrep FileThreeDays
That sed command converts the double spaces to tabs. The tr command turns those tabs into newlines (the same trick as above).
I would use awk:
awk '{$1="";print}' file
$1="" cuts the first column to an empty string, and print prints the modified line - which only contains the filename now.
However, this leaves a blank space at the start of each line. If you care about it and want to remove it, set the output field separator to an empty string:
awk '{$1="";print}' OFS="" file

Resources