deselect text in vim, like grep -v - visual-studio

I would like to immitate the following pattern of searching in vim:
grep "\<[0-9]\>" * | grep -v "666"
I can highlight all numbers using
/\<[0-9]\>"
but then how can I tell vim to remove from the highlighted text the ones that match the expression
/666
Can this be done in Visual Studio at least ?

You cannot sequentially filter the matches like in the shell, so you need to use advanced regular expression features to combine both into a single one.
Basically, you need to assert a non-match of 666 at the match position. That's achieved with the \#! atom (in other regular expression dialects, that's often written as (?!...)):
/\%(\d*666\d*\)\#!\<\d\+\>
Note: If you want to only exclude 666, but not 6666 etc. you need to specify \<666\> instead in the first part.
I've used \d instead of [0-9]; you can further strip down the \ use with the \v "very magic" modifier:
/\v(\d*666\d*)#!<\d+>

Of course, /666 doesn't match that expression.
Assuming, though, that you had e.g. \d\+ and wanted to exclude 666, you can use the negative lookahead:
\v((666)#!\d)+
This uses
\v for very magic (reducing the number of \ escapes)
\#! for "negative zero-width look-ahead assertion"

Related

grep wildcards issue ubuntu

I have an input file named test which looks like this
leonid sergeevich vinogradov
ilya alexandrovich svintsov
and when I use grep like this grep 'leonid*vinogradov' test it says nothing, but when I type grep 'leonid.*vinogradov' test it gives me the first string. What's the difference between * and .*? Because I see no difference between any number of any characters and any character followed by any number of any characters.
I use ubuntu 14.04.3.
* doesn't match any number of characters, like in a file glob. It is an operator, which indicates 0 or more matches of the previous character. The regular expression leonid*vinogradov would require a v to appear immediately after 0 or more ds. The . is the regular expression metacharcter representing any single character, so .* matches 0 or more arbitrary characters.
grep uses regex and .* matches 0 or more of any characters.
Where as 'leonid*vinogradov' is also evaluated as regex and it means leoni followed by 0 or more of letter d hence your match fails.
It's Regular Expression grep uses, short as regexp, not wildcards you thought. In this case, "." means any character, "" means any number of (include zero) the previous character, so "." means anything here.
Check the link, or google it, it's a powerful tool you'll find worth to knew.

Retain escape sequences and color when using sed

I have output out of a test reporter that returns nicely colored results, and miscellaneous garbage I want to get rid of. I tried using sed via:
karma start tests/karma.conf.js | sed 's|var.*browserify||'
...which removes the junk, but also kills the colored results. How can I retain them?
Here's an example of the raw output before sed:
^[[1A^[[2KERROR: 'Unhandled promise rejection' /var/folders/xs/wmmjbz4s6mdgcqynwn46qtmr0000gn/T/799ac09c665c85beb20f6d99be27c1cf.browserify?c65c8d7afc187ee2ed8307a171bc8e1511bfb40b:91625:48)
.* will match everything, including color codes.
If you don't want to match them, use a more specific regex, e.g. a character range.
For the given example,
/var/folders/xs/wmmjbz4s6mdgcqynwn46qtmr0000gn/T/799ac09c665c85beb20f6d99be27c1cf.browserify
a more-specific pattern might be one of these, using character classes:
sed 's|var[^[:cntrl:]]*browserify||'
sed 's|var[[:alnum:]./]*browserify||'
I would use the latter, since it would eliminate the possibility of skipping over a complete pathname (if more than one were given on a line).

How to split a file containing non-ascii characters into words, in bash?

For example, I have a file with normal text, like:
"Word1 Kuͦn, buͤtten; word4:"
I want to get a file with 1 word per line, keeping the punctiuation, and ordered:
,
:
;
Word1
Kuͦn
buͤtten
word4
The code I use:
grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt
This the code works almost perfectly, except for one thing: it splits diacretical characters apart from the letters they belong to, as if they were separate words:
,
:
;
Word1
Ku
ͦ
n
bu
ͤ
tten
word4
The letters uͦ, uͤ and other with the same diacretics are not in the ASCII table. How can I split my file correctly without deleting or replacing these characters?
Edit:
locale output:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Unfortunately, U+366 (COMBINING LATIN SMALL LETTER O) is not an alphabetic character. It is a non-spacing mark, unicode category Mn, which generally maps to the Posix ctype cntrl.
Roughly speaking, an alphabetic grapheme is an alphabetic character possibly followed by one or more combining characters. It's possible to write that as a regex pattern if you have a regex library which implements Unicode general categories. Gnu grep is usually compiled with an interface to the popular pcre (Perl-compatible regular expression) library, which has reasonably good Unicode support. So if you have Gnu grep, you're in luck.
To enable "perl-like" regular expressions, you need to invoke grep with the -P option (or as pgrep). However, that is not quite enough because by default grep will use an 8-bit encoding even if the locale specifies a UTF-8 encoding. So you need to put the regex system into "UTF-8" mode in order to get it to recognize your character encoding.
Putting all that together, you might end up with something like the following:
grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]'
-P patterns are "perl-compatible"
-o output each substring matched
(*UTF8) If the pattern starts with exactly this sequence,
pcre is put into UTF-8 mode.
\p{...} Select a character in a specified Unicode general category
\P{...} Select a character not in a specified Unicode general category
\p{L} General category L: letters
\p{N} General category N: numbers
\p{M} General category M: combining marks
\p{P} General category P: punctuation
\p{S} General category S: symbols
\p{L}\p{M}* A letter possibly followed by various combining marks
\p{L}\p{M}*|\p{N} ... or a number
More information on Unicode general categories and Unicode regular expression matching in general can be found in Unicode Technical Report 18 on regular expression matching. But beware that the syntax described in that TR is a recommendation and is not exactly implemented by most regex libraries. In particular, pcre does not support the useful notation \p{L|N} (letter or number). Instead, you need to use [\p{L}\p{N}].
Documentation about pcre is probably available on your system (man pcre); if not, have a link on me.
If you don't have Gnu grep or in the unlikely case that your version was compiled without pcre support, you might be able to use perl, python or other languages with regex capabilites. However, doing so is surprisingly difficult. After some experimentation, I found the following Perl incantation which seems to work:
perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'
Here, -CIO tells Perl that input and output in UTF-8, and -nle is a standard incantation which means "automatically output new**l**ines after a print; loop through every li**n**e of the input, **e**xecuting the following in the loop".

How to display the non-whitespace character count of a visual selection in Vim?

I want to count the characters without whitespace of a visual selection.
Intuitively, I tried the following
:'<,'>w !tr -d [:blank:] | wc -m
But vim does not like it.
This is possible with the following substitute command:
:'<,'>s/\%V\S//gn
The two magical ingredients are
the n flag of the substitute command. What it does is
Report the number of matches, do not actually substitute. (...) Useful to count items.
See :h :s_flags, and check out :h count-items, too.
the zero-width atom \%V. It matches only inside the Visual selection. As a zero-width match it makes an assertion about the following atom \S "non-space", which is to match only when inside the Visual selection. See :h /\%V.
The whole command thus substitutes :s nothing // for every non-whitespace character \S inside the Visual selection \%V, globally g – only that it doesn't actually carry out any substitutions but instead reports how many times it would have!
In order to count the non-whitespace characters within a visual selection in vim, you could do a
:'<,'>s/\S/&/g
Vim will then tell how many times it replaced non-whitespace characters (\S) with itself (&), that is without actually changing the buffer.
You must escape special character for the shell, and use [:space:] better because it will delete also the newline character. It should be:
:'<,'>w !tr -d '[:space:]' | wc -m

Why does sed not replace overlapping patterns

I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.
The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".
If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.
I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.
Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.
I know I could probably switch to awk, perl, python but I want to know what is happening in sed.
Not dissimilar to the perl solution, this works for me using pure sed
With #Robin A. Meade improvement
sed ':repeat;
s|\t\t|\t\n\t|g;
t repeat'
Explanation
:repeat is a label, used for branch commands, similar to batch
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.
t repeat means if the "s" command did any replaces, then goto the label repeat, else it goes onto the next line and starts over again.
So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.
While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.
As #thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.
Original Answer
sed ':repeat;
/\t\t/{
s|\t\t|\t\n\t|g;
b repeat
}'
Explanation
:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat
Short version
Which can be shortened to
sed ':r;s|\t\t|\t\n\t|g; t r'
# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'
MacOS
And the Mac (yet still Linux/Windows compatible) version:
sed $':r\ns|\t\t|\t\\\n\t|g; t r'
# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'
Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.
I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):
perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>
As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.
sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'
... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)
Right, even with /g, sed will not match the text it replaced again. Thus, it's read <TAB><TAB> and output <TAB>\N<TAB> and then reads the next thing in from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7
In a regex language that supports lookaheads, you can get around this with a lookahead.
Well, sed simply works as designed. The input line is scanned once, not multiple times. Maybe it helps to look at the consequences if sed used rescanning the input line to deal with overlapping patterns by default: in this case even simple substitutions would work quite differently--some might say counter-intuitively--, e.g.
s/^/ / inserting a space at the beginning of a line would never terminate
s/$/foo/ appending foo to each line - likewise
s/[A-Z][A-Z]*/CENSORED/ replacing uppercase words with CENSORED - likewise
There are probably many other situations. Of course these could all be remedied with, say, a substitution modifier, but at the time sed was designed, the current behavior was chosen.

Resources