Windows SED command - simple search and replace without regex - windows

How should I use 'sed' command to find and replace the given word/words/sentence without considering any of them as the special character?
In other words hot to treat find and replace parameters as the plain text.
In following example I want to replace 'sagar' with '+sagar' then I have to give following command
sed "s/sagar/\\+sagar#g"
I know that \ should be escaped with another \ ,but I can't do this manipulation.
As there are so many special characters and theie combinations.
I am going to take find and replace parameters as input from user screen.
I want to execute the sed from c# code.
Simply, I do not want regular expression of sed to use. I want my command text to be treated as plain text?
Is this possible?
If so how can I do it?

While there may be sed versions that have an option like --noregex_matching, most of them don't have that option. Because you're getting the search and replace input by prompting a user, you're best bet is to scan the user input strings for reg-exp special characters and escape them as appropriate.
Also, will your users expect for example, their all caps search input to correctly match and replace a lower or mixed case string? In that case, recall that you could rewrite their target string as [Ss][Aa][Gg][Aa][Rr], and replace with +Sagar.
Note that there are far fewer regex characters used on the replacement side, with '&' meaning "complete string that was matched", and then the numbered replacment groups, like \1,\2,.... Given users that have no knowledge or expectation that they can use such characters, the likelyhood of them using is \1 in their required substitution is pretty low. More likely they may have a valid use for &, so you'll have to scan (at least) for that and replace with \&. In a basic sed, that's about it. (There may be others in the latest gnu seds, or some of the seds that have the genesis as PC tools).
For a replacement string, you shouldn't have to escape the + char at all. Probably yes for \. Again, you can scan your user's "naive" input, and add escape chars as need.
Finally if you're doing this for a "package" that will be distributed, and you'll be relying on the users' version of sed, beware that there are many versions of sed floating around, some that have their roots in Unix/Linux, and others, particularly of super-sed, that (I'm pretty sure) got started as PC-standalones and has a very different feature set.
IHTH.

Related

Replacing (escaping) characters in Groovy

For a gradle script, I am composing strings that will be used as command line for a subsequent gradle Test-task. One of the strings is the user's password, which eventually will be passed to the called (exec'ed) "java ..." call using the JVM's -D option, e.g. -Dpassword=foobar.
What complicates things here is, that this password can/should of course contain special characters, that may interfere with the use of the string as command line. In other words: I need to escape special characters (which is OS-specific). :-(
Now to my actual question:
I want to use the String.replaceAll method, i.e. replaceAll(list_of_special characters, EscapeCharacter + Ref_to_matched_character),
e.g. simplified something like replaceAll("[#$%^&]", "^$1")
'^' meaning the escape character and '$1' meaning the matched character here.
Is that possible, i.e. can one refer to the matched pattern in the second argument of replaceAll?
Is that possible, i.e. can one refer to the matched pattern in the second argument of replaceAll?
yes, it's possible
'a#b$c'.replaceAll('([#$%^&])', '^$1')
returns
a^#b^$c
Thanks for the responses and the reviews improving readability. Meanwhile I got my expression working. For those interested:
// handles gthe following: `~!##$%^&*()_+-={}|[]\:;"'<>?,./
escaped = original.replaceAll('[~!##\\$\\%\\^\\&\\*\\(\\)_\\+-={}\\|\\[\\]\\\\:;\"\\\'<>\\?,\\./]', '^$0') // for Windows - cmd.exe

Using sed to remove period at the end of string (zip code)

I have a file of addresses that I am attempting to scrub and I am using sed to get rid of unwanted charachters and formatting. In this case, I have zip codes followed by a period:
Mr. John Doe
Exclusively Stuff, 186
Caravelle Drive, Ponte Vedra FL
33487.
(for the time being, ignore the new lines; I am just focusing on the zip and period for now)
I want to remove the period (.) from the zip as my first step in cleaning this up. I tried to use sub strings in sed as follows (using "|" as a delimiter - it easier for me to see):
sed 's|\([0-9]{4}\)\.|\1|g' test.txt
Unfortunately, it doesn't remove the period. It just prints it out as part of the sub string based on this post:
Replace period surrounded by characters with sed
A point in the right direction would be greatly appreciated.
You specified 4 digits {4} but have 5 and you have to escape the { and }, for example:
sed 's|\(^[0-9]\{5\}\).*|\1|g' test.txt
Notice that you also have a space after the dot, so you might want to trim everything following five digits but to be safe you might want to specify that they must be at start of line ^.
In my case, if I type info sed which is more complete than man sed, I find this:
'-r'
'--regexp-extended'
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that 'egrep' accepts; they
can be clearer because they usually have less backslashes, but are
a GNU extension and hence scripts that use them are not portable.
*Note Extended regular expressions: Extended regexps.
And under Appendix A Extended regular expressions you can read:
The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, braces ('{}'),
and '|'. While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them _to match a
literal character_. '|' is special here because '\|' is a GNU extension
- standard basic regular expressions do not provide its functionality.
Examples:
'abc?'
becomes 'abc\?' when using extended regular expressions. It
matches the literal string 'abc?'.
'c\+'
becomes 'c+' when using extended regular expressions. It matches
one or more 'c's.
'a\{3,\}'
becomes 'a{3,}' when using extended regular expressions. It
matches three or more 'a's.
'\(abc\)\{2,3\}'
becomes '(abc){2,3}' when using extended regular expressions. It
matches either 'abcabc' or 'abcabcabc'.
'\(abc*\)\1'
becomes '(abc*)\1' when using extended regular expressions.
Backreferences must still be escaped when using extended regular
expressions.
Basic Solution: Use a Range Atom to Handle Your Posted Input
An easy (but slightly naive) way to do this with your posted input is to look for:
start of line
followed by exactly 5 digits (a standard US ZIP Code)
followed by zero or more characters (e.g. a ZIP+4)
followed by zero or more non-period characters (don't match a street address)
followed by a literal period
and just replace the whole match with the captured part of the match. For example:
With BSD sed or without extended expressions:
sed 's/^\([[:digit:]]\{5\}[^.]*\)\./\1/'
With GNU sed and extended regular expressions:
sed -r 's/^([[:digit:]]{5}[^.]*)\./\1/'
Either way, given your posted input you end up with:
Mr. John Doe
Exclusively Stuff, 186
Caravelle Drive, Ponte Vedra FL
33487
Advanced Solution: Handle ZIP Codes Properly
The main caveat is that the solution above works with your posted sample, but won't match if the ZIP Code is properly at the end of the last line of the address as it should be in a standardized USPS address. That's fine if you've got a custom format, but it will likely cause you problems with standardized or corrected addresses such as:
Mr. John Doe
12345 Exclusively Stuff, 186
Caravelle Drive, Ponte Vedra FL 33487.
The following will work with both your posted input and a more typical USPS address, but your mileage on other non-standard inputs may vary.
# More reliable, but much harder to read.
sed -r 's/([[:digit:]]{5}(-[[:digit:]]{4})?[[:space:]]*)\.[[:space:]]*$/\1/'

Is there any character that is illegal in file paths on every OS?

Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.
In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].
You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.

Matching an unescaped balanced pair of delimiters

How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).

Colon/Asterisk as a filename delimiter?

I'm looking for a character to use a filename delimiter (I'm storing multiple filenames in a plaintext string). Windows seems not to allow :, ?, *, <, >, ", |, / and \ in filenames. Obviously, \ and / can't be used, since they mean something within a path. Is there any reason why any of those others shouldn't be used? I'm just thinking that, similar to / or \, those other disallowed characters may have special meaning that I shouldn't assume won't be in path names. Of those other 7 characters, are any definitely safe or definitely unsafe to use for this purpose?
The characters : and " are also used in paths. Colon is the drive unit delimiter, and quotation marks are used when spaces are part of a folder or file name.
The charactes * and ? are used as wildcards when searching for files.
The characters < and > are used for redirecting an application's input and output to and from a file.
The character | is used for piping output from one application into input of another application.
I would choose the pipe character for separating file names. It's not used in paths, and its shape has a natural separation quality to it.
An alternative could be to use XML in the string. There is a bit of overhead and some characters need encoding, but the advantage is that it can handle any characters and the format is self explanatory and well defined.
Windows uses the semicolon as a filename delimiter: ;. look at the PATH environment variable, it is filled with ; between path elements.
(Also, in Python, the os.path.pathsep returns ";", while it expands to ":" on Unix)
I have used * in the past. The reason for portability to Linux/Unix. True, technically it can be used on those fileysystems too. In practice, all common OSes use it as a wildcard, thus it's quite uncommon in filenames. Also, people are not surprised if programs do break when you put a * in a filename.
Why dont you use any character with ALT key combination like ‡ (Alt + 0135) as delimiter ?
It is actually possible to create files programmatically with every possible character except \. (At least, this was true at one time and it's possible that Windows has changed its policy since.) Naturally, files containing certain characters will be harder to work with than others.
What were you using to determine which characters Windows allows?
Update: The set of characters allowed by Windows is also be determined by the underlying filesystem, and other factors. There is a blog entry on MSDN that explains this in more detail.
If all you need is the appearance of a colon, and will be creating it programatically, why not make use of a UTF-8 character that just looks like a colon?
My first choice would be the Modifier Letter (U+A789), as it is a typical RTL character and appears a lot like a colon. It is what I use when I need a full DateTime in the filename, such as file_2017-05-04_16꞉45꞉22_clientNo.jpg
I would stay away from characters like the Hebrew Punctuation Sof Pasuq (U+05C3), as it is a LTR character and may mess with how a system aligns the file name itself.

Resources