Does Perl's /m regex modifier match differently on Windows? - windows

The following Perl statements behave identically on Unixish machines. Do they behave differently on Windows? If yes, is it because of the magic \n?
split m/\015\012/ms, $http_msg;
split m/\015\012/s, $http_msg;
I got a failure on one of my CPAN modules from a Win32 smoke tester. It looks like it's an \r\n vs \n issue. One change I made recently was to add //m to my regexes.

For these regexes:
m/\015\012/ms
m/\015\012/s
Both /m and /s are meaningless.
/s: makes . match \n too.
Your regex doesn't contain .
/m: makes ^ and $ match next to embedded \n in the string.
Your regex contains no ^ nor $, or their synonyms.
What is possible is indeed if your input handle (socket?) works in text mode, the \r (\015) characters will have been deleted on Windows.
So, what to do? I suggest making the \015 characters optional, and split against
/\015?\012/
No need for /m, /s or even the leading m//. Those are just cargo cult.

There is no magic \n. Both \n and \r always mean exactly one character, and on all ASCII-based platforms that is \cJ and \cM respectively. (The exceptions are EBCDIC platforms (for obvious reasons) and MacOS Classic (where \n and \r both mean \cM).)
The magic that happens on Windows is that when doing I/O through a file handle that is marked as being in text mode, \r\n is translated to \n upon reading and vice versa upon writing. (Also, \cZ is taken to mean end-of-file – surprise!) This is done at the C runtime library layer.
You need to binmode your socket to fix that.
You should also remove the /s and /m modifiers from your pattern: since you do not use the meta-characters whose behaviour they modify (. and the ^/$ pair, respectively), they do nothing – cargo cult.

Why did you add the /m? Are you trying to split on line? To do that with /m you need to use either ^ or $ in the regex:
my #lines = split /^/m, $big_string;
However, if you want to treat a big string as lines, just open a filehandle on a reference to the scalar:
open my $string_fh, '<', \ $big_string;
while( <$string_fh> ) {
... process a line
}

Related

How to replace newline in csv with fart.exe at Windows command-line?

In FART.exe find/replace tool, i’m having some trouble with newline char:
> set lf=^& echo.
> fart myfile.csv lf "],["
Replaced 0 occurence(s) in 0 file(s).
Nothing gets replaced. What's the correct (and simplest) way to do this?
The line-terminators in the file might be CarriageReturn + LineFeed, not sure. How to check for both?
(PS, fart is the fastest replace tool i've tested on Windows. Tested much faster than repl.bat, jrepl.bat, findrepl.bat, sfk.ext, and powershell.)
edited to adapt to comments
fart includes a switch (-C or --c-style) to indicate that input/output strings contain C-style extended characters. In this case quotes are not needed around the search or replacement strings as there are no special characters in the command line:
fart -C myfile.csv \r\n ],[

sed / Batch / Windows: Prevent changig Backslash to slash

I have a variable with a path, like this:
SET "somevar=D:\tree\path\nonsens\oink.txt"
And I have a file, where somethink like this is written
VAR=moresonsense
Now I want to replace the word morenonsense to D:\tree\path\nonsens\oink.txt. This should be the result
VAR=D:\tree\path\nonsens\oink.txt
For this, I am using the tool sed for windows. But using sed in windows gives me the following:
VAR=D: ree/path/nonsens/oink.txt
The spaces between the colon and ree is a tab. I thought, I could fix it with the following line before calling sed:
SET "somevar=%somevar:\\=\\\\%"
But no, this line is not working. So I have some questions:
Is there a possibility, to prevent sed from changing \t to a tab and prevent changing two backslashed \ to a slash /?
Is there another easy way to replace a string with another string within a file with BATCH?
Does someone has another idea how to resolve this problem?
You should not \-escape the \ instances in the variable expansion; use the following:
SET "somevar=%somevar:\=\\%"
I don't know whether that solves all your problems, but SET "somevar=%somevar:\\=\\\\%" definitely does not work as intended, because it'll only match two consecutive \ chars in the input, resulting in a no-op with your input.

Removing diacritical marks from a Greek text in an automatic way

I have a decompiled stardict dictionary in the form of a tab file
κακός <tab> bad
where <tab> signifies a tabulation.
Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.
Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become
κακος <tab> <h3>κακός</h3> <br/> bad
I know I could read the file line by line in bash, as described here [1]
while read line
do
command
done <file
But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.
Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?
/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?
You can remove diacritics from a string relatively easily using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
for example:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
This works as follows:
The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks
This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.
I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)
However I went through the vowels and found out which combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ
ἒὲέἐἔ
ἦἢῆὴήἠἤ
ἶἲῖὶίἰἴ
ὂὸόὀὄ
ὖὒῦὺύὐὔ
ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
Credit to hungnv
It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:
ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω
Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.

Absolute path in perl's copy command gone wrong?

This is my very simple code, which isn't working, for some reason I can't figure out.
#!/usr/bin/perl
use File::Copy;
$old = "car_lexusisf_gray_30inclination_000azimuth.png";
$new = "C:\Users\Lenovo\Documents\mycomp\simulation\cars\zzzorganizedbyviews\00inclination_000azimuth\lexuscopy.png";
copy ($old, $new) or die "File cannot be copied.";
I get the error that the file can't be copied.
I know there's nothing wrong with the copy command because if I set the value of $new to something simple without a path, it works. But what is wrong in the representation of the path as I've written it above? If I copy and past it into the address bar of windows explorer, it reaches that folder fine.
Tip: print out the paths before you perform the copy. You'll see this:
C:SERSenovodocumentsmycompsimulationrszzzorganizedbyviewsinclination_000azimuthexuscopy.png
Not what we wanted. The backslash is an escape character in Perl, which needs to be escaped itself. If the backslash sequence does not form a valid escape, then it's silently ignored. With escaped backslashes, your string would look like:
"C:\\Users\\Lenovo\\Documents\\mycomp\\simulation\\cars\\zzzorganizedbyviews\\00inclination_000azimuth\\lexuscopy.png";
or just use forward slashes instead – in most cases, Unix-style paths work fine on Windows too.
Here is a list of escapes you accidentally used:
\U uppercases the rest
\L lowercases the rest
\ca is a control character (ASCII 1, the start of heading)
\00 is an octal character, here the NUL byte
\l lowercases the next character.
If no interpolation is intended, use single quotes instead of double quotes.

Why does sed not replace overlapping patterns

I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.
The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".
If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.
I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.
Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.
I know I could probably switch to awk, perl, python but I want to know what is happening in sed.
Not dissimilar to the perl solution, this works for me using pure sed
With #Robin A. Meade improvement
sed ':repeat;
s|\t\t|\t\n\t|g;
t repeat'
Explanation
:repeat is a label, used for branch commands, similar to batch
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.
t repeat means if the "s" command did any replaces, then goto the label repeat, else it goes onto the next line and starts over again.
So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.
While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.
As #thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.
Original Answer
sed ':repeat;
/\t\t/{
s|\t\t|\t\n\t|g;
b repeat
}'
Explanation
:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat
Short version
Which can be shortened to
sed ':r;s|\t\t|\t\n\t|g; t r'
# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'
MacOS
And the Mac (yet still Linux/Windows compatible) version:
sed $':r\ns|\t\t|\t\\\n\t|g; t r'
# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'
Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.
I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):
perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>
As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.
sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'
... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)
Right, even with /g, sed will not match the text it replaced again. Thus, it's read <TAB><TAB> and output <TAB>\N<TAB> and then reads the next thing in from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7
In a regex language that supports lookaheads, you can get around this with a lookahead.
Well, sed simply works as designed. The input line is scanned once, not multiple times. Maybe it helps to look at the consequences if sed used rescanning the input line to deal with overlapping patterns by default: in this case even simple substitutions would work quite differently--some might say counter-intuitively--, e.g.
s/^/ / inserting a space at the beginning of a line would never terminate
s/$/foo/ appending foo to each line - likewise
s/[A-Z][A-Z]*/CENSORED/ replacing uppercase words with CENSORED - likewise
There are probably many other situations. Of course these could all be remedied with, say, a substitution modifier, but at the time sed was designed, the current behavior was chosen.

Resources