Split up line with arbitrary many groups - bash

I have many files with many entries (one entry per line) which I have to filter through a sequence of greps and seds. The lines are of the form
a
x, y
u --> v, w
s --> p, q, r
One the steps is splitting up the lines containing --> such that the left-hand side and each of the comma-separated entries on the right side (of which there can be arbitrary many) end up on different lines. I.e., the above lines should become:
a
x, y
u
v
w
s
p
q
r
Separating the left side from the right side is quickly done:
echo "u --> v, w" | sed 's/\(.\+\)\s*\-\->\s*\(.\+\)/\1\n\2/'
Gives me
u
v, w
But this seems to be a dead end in that I cannot then pipe this on to splitting on the comma, since that would also split the x, y.
So, I am wondering if there is a way to completely split up such lines in a sed command, or do I have to turn to, e.g., awk (or just go to Python)? It would be preferable to keep this a bash pipe sequence.

awk '/-->/ {gsub(/-->|,/,RS)}1' inputfile|column -t
a
x, y
u
v
w
s
p
q
r
OR as Anubhav suggested to avoid pipe:
awk '/-->/ {gsub(/[ \t]*(-->|,)[ \t]*/ , ORS)} 1' inputfile

Using awk you can do this:
awk -F'[ \t]*-->[ \t]*' -v OFS='\n' '{gsub(/,[ \t]*/, OFS, $2)} 1' file
a
x, y
u
v
w
s
p
q
r

You can do this by creating a command group when you match -->. In this group, you replace --> with newline, print up to the newline, discard the portion you printed, then replace commas in the remainder:
#!/bin/sed -f
/\s*-->\s*/{
s//\n/
P
s/.*\n//
s/,\s*/\n/g
}
Results:
a
x, y
u
v
w
s
p
q
r
Alternatively, in GNU sed, you could use the T command to skip processing of the right-hand side unless you match and replace the -->:
#!/bin/sed -f
s/\s*-->\s*/\n/
Tend
P
s/.*\n//
s/,\s*/\n/g
:end
This produces the same output, as required.
I've assumed throughout that you don't want to split any commas on the left-hand side, so that
foo, bar --> baz
becomes
foo, bar
baz
If that's not the case (perhaps if you know there will be no comma to the left of -->), then you don't need P or s/.*\n//, and the script is as simple as
/\s*-->\s*/!n
s//\n/
s/,\s*/\n/g

Related

How to insert space between characters after some specific symbols?

I have a text file in the following format:
\Hollands\\\\\\hOlAnz/hOlAns\\\\\\\\
\Hollandse\\\\\\hOlAns#\\\\\\\\
\Hollywood\\\\\\hOliwud/hOliwut/hOliwYd\\\\\\\\
...
and I would like to make it look like this ⬇️:
\Hollands\\\\\\h O l A n z / h O l A n s\\\\\\\\
\Hollandse\\\\\\h O l A n s #\\\\\\\\
\Hollywood\\\\\\h O l i w u d / h O l i w u t / h O l i w Y d\\\\\\\\
What should I do?
Many thanks in advance.
I tried using sed:
sed 's/\{\\\\\\\{1\}\)/\1 /g'
as I was expecting to insert\1 / (one blank space) after {1\}(each one character), after 6 repetitive \s.
but got the error saying
RE error: invalid repetition count(s)
sed is the right tool for doing a simple s/old/new/ operation on individual strings, for anything more than that (e.g. isolating part of a string and then doing further operations on parts of that string as you need), just use awk.
Using GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(.*\\{6}.)(.*)(\\{8})/,a) { $0=a[1] gensub(/./," &","g",a[2]) a[3] } 1' file
\Hollands\\\\\\h O l A n z / h O l A n s\\\\\\\\
\Hollandse\\\\\\h O l A n s #\\\\\\\\
\Hollywood\\\\\\h O l i w u d / h O l i w u t / h O l i w Y d\\\\\\\\
This might work for you (GNU sed):
sed -E 's/(\\{6}[^\\])(.*\\{8})/\1 \n\2/;:a;s/\n([^\\])/\1 \n/;ta;s/ \n//' file
Turn on extended regexp -E.
For all lines,using pattern matching insert a space and unique delimiter (\n) the character after six \'s.
Iterate replacing the newline and the following character by the following character a space and the newline delimiter until the next occurrence of \.
Remove the last space and newline and print the result.

grep a list into a multi columns file and get fully matching lines

not sure how to ask this question but an example would surely clarify. Suppose I have this file:
$ cat intoThat
a b
a h
a l
a m
b c
b d
b m
c b
c d
c f
c g
c p
d h
d f
d p
and this list:
cat grepThis
a
b
c
d
now I would like to grepThis intoThat and I would do this:
$grep -wf grepThis intoThat
which will give an output like this:
**a b**
a h
a l
a m
**b c**
**b d**
b m
**c b**
**c d**
c f
c g
c p
d h
d f
d p
now the asterisks are used to highlight those lines that I would like grep to return. These are the lines that have a full match but...how to tell grep (or awk or whatever) to get only these lines?
Of course it is possible that some lines do not match any pattern, e.g. in the intoThat file I may have some other letters like g, h, l, s, t, etc...
With awk, you could do:
awk 'NR==FNR{ seen[$0]++; next } ($1 in seen && $2 in seen)' grepThis intoThat
a b
b c
b d
c b
c d
NR is set to 1 when the first record read by awk and incrementing for each next records reading either in single or multiple input files until all records/line read.
FNR is set to 1 when the first record read by awk and incrementing for each next records reading in current file and reset back to 1 for the next input file if multiple input files.
so NR == FNR is always a true condition for first input file and the block followed by this will perform actions on the first file only.
The seen is an associated awk array named seen (you can use different name as you want) with the key of whole line $0 and value with occurrences of each line occurred (this way usually is using to remove duplicated records in awk too).
The next token skips to executing rest of the commands and those will only execute actually for next file(s) except first.
In next (....), we are just checking if both column$1 and $2 are present in the array, if so they will goes in output.

grep with regular expression to find two words when word b is AFTER word a in a sentence

I have a big text file, each line containing a sentence.
I want to use grep (or something similar in batch) to find sentences where word b occurs exactly or not exactly (some word(s) between them) after word a.
I don't want grep to return a sentence like this:
f g s b d a
because b is not after a but I want to return a sentence like
f g a d m s b f
because b is after a.
It is OK to return sentences where a is both after and before b:
s a s b s a s
I also don't want sentences with only a or b.
I just want the sentences where b is after a (something can be in the middle).
I can easily do it with Python but I want to use the beauty of bash.
Try to do that:
grep "a.*b" file

I would like to sort rows of a data file by NF increasing

I would like to sort rows of a data file by NF increasing.
input
z a b c d k l p m
m x y h j i
y w
g t y u
output
y w
g t y u
m x y h j i
z a b c d k l p m
I had tried sort command, but it no works.
How to?
Thanks for help.
Typically you solve these types of problems by modifying the input stream to add some data, operating on that data, and then removing it. In this case, we want to add the field count to the input stream, sort (numerically) on the field count, and then remove it (using a space as the field delimiter):
awk '{ print NF, $0 }' | sort -n | cut -d' ' -f2-
You can either pipe your data to awk or pass the filename as another argument to awk.

How do you make a range in bash case sensitive?

I'm trying to use a for loop like this, where the range consists of letters, both uppercase and lowercase. The problem is, bash doesn't differentiate uppercase and lowercase when it's within a range. How do I make it case sensitive? TIA.
for s in {a..z,A..Z}
do
echo ${s}
done
If you want the letters in that order, just use:
for s in {a..z} {A..Z}
There's no requirement that bash only allows a single brace expansion per line.
The two forms currently allowed are mutually exclusive, being either a selection (of two or more) or a range:
{<val1>,<val2>[,...]}
{<from>..<to>[..<incr>]}
The brace expression {a..z,A..Z} simply expands, using the first form, to the two words (not ranges):
a..z
A..Z
Looks like you got the syntax wrong.
$ echo $BASH_VERSION
3.2.25(1)-release
$ echo {a..k} {A..K}
a b c d e f g h i j k A B C D E F G H I J K
$ echo {a..k,A..K}
a..k A..K

Resources