ctags regex for multiple declarations in one line - ctags

I'm writing a .ctags file for a custom language... Like most languages, it allows for multiple variable declarations in one line.. i.e.:
int a, b, c;
I have a basic regex which recognizes 'a':
--regex-mylang=/^[ \t]*int[ \t]*([a-zA-Z_0-9]+)/\1/v,variable/
How do I modify this to have it match 'b' and 'c', as well? I can't find anything in ctags documentation that deals with multiple matches in a single line.

The latest Universal-ctags can capture them.
[jet#localhost]/tmp% cat input.x
int a, b, c;
[jet#localhost]/tmp% cat x.ctags
--langdef=X
--map-X=.x
--kinddef-X=v,var,variables
--_tabledef-X=main
--_tabledef-X=vardef
--_mtable-regex-X=main/int[ \t]+//{tenter=vardef}
--_mtable-regex-X=main/.//
--_mtable-regex-X=vardef/([a-zA-Z0-9]+)/\1/v/
--_mtable-regex-X=vardef/;//{tleave}
--_mtable-regex-X=vardef/.//
[jet#localhost]/tmp% u-ctags --options=x.ctags -o - ./input.x
a ./input.x /^int a, b, c;$/;" v
b ./input.x /^int a, b, c;$/;" v
c ./input.x /^int a, b, c;$/;" v
See https://docs.ctags.io/en/latest/optlib.html#advanced-pattern-matching-with-multiple-regex-tables for more details.

After going through this for a few hours, I'm convinced it can't be done. No matter what, the regular expression will only expand to one tag per line. Even if you put \1 \2 \3 ... as the expansion, that would just cause a tag consisting of multiple matches, instead of one tag per match.
It parses the C example correctly because inside the ctags source code it uses an actual code parser, not a regexp.

You're trying to do parsing with a regex, which is not generally possible. Parsing requires the equivalent of storing information on a stack, but a regular expression can embody only a finite number of different states.

it can be partialy done with the Universal Ctags and with the help of {_multiline=N} and {scope} flag. The N is group number which position is saved in generated tags file.
For more information look here: docs/optlib.rst
Configuration: mylang.ctags
--langmap=mylang:.txt
--regex-mylang=/^[[:blank:]]*(int)[[:blank:]]/\1/{placeholder}{scope=set}{_multiline=1}
--regex-mylang=/(;)/\1/{placeholder}{scope=clear}
--regex-mylang=/[[:blank:]]*([[:alnum:]]+)[[:blank:]]*,?/\1/v,variable/{_multiline=1}{scope=ref}
Test file: test.txt
void main() {
int a, b, c, d;
}
Generate tags with: ctags --options=mylang.ctags test.txt
Generated tags file:
!_TAG_FILE_FORMAT 2 /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED 1 /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_MODE u-ctags /u-ctags or e-ctags/
!_TAG_PROGRAM_AUTHOR Universal Ctags Team //
!_TAG_PROGRAM_NAME Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL https://ctags.io/ /official site/
!_TAG_PROGRAM_VERSION 0.0.0 /cb4476eb/
a test.txt /^ int a, b, c, d;$/;" v
b test.txt /^ int a, b, c, d;$/;" v
c test.txt /^ int a, b, c, d;$/;" v
d test.txt /^ int a, b, c, d;$/;" v
int test.txt /^ int a, b, c, d;$/;" v
main test.txt /^void main() {$/;" v
void test.txt /^void main() {$/;" v

--regex-perl=/^\s*?use\s+(\w+[\w\:]*?\w*?)/\1/u,use,uses/
--regex-perl=/^\s*?require\s+(\w+[\w\:]*?\w*?)/\1/r,require,requires/
--regex-perl=/^\s*?has\s+['"]?(\w+)['"]?/\1/a,attribute,attributes/
--regex-perl=/^\s*?\*(\w+)\s*?=/\1/a,aliase,aliases/
--regex-perl=/->helper\(\s?['"]?(\w+)['"]?/\1/h,helper,helpers/
--regex-perl=/^\s*?our\s*?[\$#%](\w+)/\1/o,our,ours/
--regex-perl=/^=head1\s+(.+)/\1/p,pod,Plain Old Documentation/
--regex-perl=/^=head2\s+(.+)/-- \1/p,pod,Plain Old Documentation/
--regex-perl=/^=head[3-5]\s+(.+)/---- \1/p,pod,Plain Old Documentation/

Related

grep a list into a multi columns file and get fully matching lines

not sure how to ask this question but an example would surely clarify. Suppose I have this file:
$ cat intoThat
a b
a h
a l
a m
b c
b d
b m
c b
c d
c f
c g
c p
d h
d f
d p
and this list:
cat grepThis
a
b
c
d
now I would like to grepThis intoThat and I would do this:
$grep -wf grepThis intoThat
which will give an output like this:
**a b**
a h
a l
a m
**b c**
**b d**
b m
**c b**
**c d**
c f
c g
c p
d h
d f
d p
now the asterisks are used to highlight those lines that I would like grep to return. These are the lines that have a full match but...how to tell grep (or awk or whatever) to get only these lines?
Of course it is possible that some lines do not match any pattern, e.g. in the intoThat file I may have some other letters like g, h, l, s, t, etc...
With awk, you could do:
awk 'NR==FNR{ seen[$0]++; next } ($1 in seen && $2 in seen)' grepThis intoThat
a b
b c
b d
c b
c d
NR is set to 1 when the first record read by awk and incrementing for each next records reading either in single or multiple input files until all records/line read.
FNR is set to 1 when the first record read by awk and incrementing for each next records reading in current file and reset back to 1 for the next input file if multiple input files.
so NR == FNR is always a true condition for first input file and the block followed by this will perform actions on the first file only.
The seen is an associated awk array named seen (you can use different name as you want) with the key of whole line $0 and value with occurrences of each line occurred (this way usually is using to remove duplicated records in awk too).
The next token skips to executing rest of the commands and those will only execute actually for next file(s) except first.
In next (....), we are just checking if both column$1 and $2 are present in the array, if so they will goes in output.

Split up line with arbitrary many groups

I have many files with many entries (one entry per line) which I have to filter through a sequence of greps and seds. The lines are of the form
a
x, y
u --> v, w
s --> p, q, r
One the steps is splitting up the lines containing --> such that the left-hand side and each of the comma-separated entries on the right side (of which there can be arbitrary many) end up on different lines. I.e., the above lines should become:
a
x, y
u
v
w
s
p
q
r
Separating the left side from the right side is quickly done:
echo "u --> v, w" | sed 's/\(.\+\)\s*\-\->\s*\(.\+\)/\1\n\2/'
Gives me
u
v, w
But this seems to be a dead end in that I cannot then pipe this on to splitting on the comma, since that would also split the x, y.
So, I am wondering if there is a way to completely split up such lines in a sed command, or do I have to turn to, e.g., awk (or just go to Python)? It would be preferable to keep this a bash pipe sequence.
awk '/-->/ {gsub(/-->|,/,RS)}1' inputfile|column -t
a
x, y
u
v
w
s
p
q
r
OR as Anubhav suggested to avoid pipe:
awk '/-->/ {gsub(/[ \t]*(-->|,)[ \t]*/ , ORS)} 1' inputfile
Using awk you can do this:
awk -F'[ \t]*-->[ \t]*' -v OFS='\n' '{gsub(/,[ \t]*/, OFS, $2)} 1' file
a
x, y
u
v
w
s
p
q
r
You can do this by creating a command group when you match -->. In this group, you replace --> with newline, print up to the newline, discard the portion you printed, then replace commas in the remainder:
#!/bin/sed -f
/\s*-->\s*/{
s//\n/
P
s/.*\n//
s/,\s*/\n/g
}
Results:
a
x, y
u
v
w
s
p
q
r
Alternatively, in GNU sed, you could use the T command to skip processing of the right-hand side unless you match and replace the -->:
#!/bin/sed -f
s/\s*-->\s*/\n/
Tend
P
s/.*\n//
s/,\s*/\n/g
:end
This produces the same output, as required.
I've assumed throughout that you don't want to split any commas on the left-hand side, so that
foo, bar --> baz
becomes
foo, bar
baz
If that's not the case (perhaps if you know there will be no comma to the left of -->), then you don't need P or s/.*\n//, and the script is as simple as
/\s*-->\s*/!n
s//\n/
s/,\s*/\n/g

grep with regular expression to find two words when word b is AFTER word a in a sentence

I have a big text file, each line containing a sentence.
I want to use grep (or something similar in batch) to find sentences where word b occurs exactly or not exactly (some word(s) between them) after word a.
I don't want grep to return a sentence like this:
f g s b d a
because b is not after a but I want to return a sentence like
f g a d m s b f
because b is after a.
It is OK to return sentences where a is both after and before b:
s a s b s a s
I also don't want sentences with only a or b.
I just want the sentences where b is after a (something can be in the middle).
I can easily do it with Python but I want to use the beauty of bash.
Try to do that:
grep "a.*b" file

Efficient way to code a number within a range?

If I have something like:
value is between 1-1000
And if value is within 1-100, output A
within 101-200, output B
within 201-300, output C
within 301-400, output D
within 401-500, output E
else, output F
Can this be done more "efficiently" or better than having if statements for each one?
You could use a mapping between value and output:
outputs = [ A, B, C, D, E, F, F, F, F, F]
output = outputs[(int)((value - 1)/ 100)]

How to exchange lines in the same file?

I have a text like this (in rows):
A
B
C
D
E
F
and I'd like to change line B by line D, and line C to by line E, obtaining (in rows):
A
D
E
B
C
F
is it any simple way to do it with bash?
You can use the mapfile builtin to read the entire file into an array of lines. Then in that array reorder however you want and write the array back out to a file.

Resources