Unix : Optimized command for subsituting words in large file - bash

This question is not related to any code issue. Just need your suggestions.
We have a file which is ~ 100GB and we are applying sed to substitute a few parameters.
This process is taking long time and eating up CPU as well
Can the replacement of sed with awk/tr/perl or any other unix utilities will help in this scenario.
Note:
Any suggestion other than time command.

You can do a couple of things to speed it up:
use fixed pattern matching instead of regexes wherever you can
run sed for example as LANG=C sed '...'
These two are likely to help a lot. Anything else will lead to just minor improvements, even different tools.
About LANG=C - normally the matching is done in whatever encoding your environment is set to which can likely be UTF-8 which causes additional lookups of the UTF-8 characters. If your patterns use just ascii, then definitely go for LANG=C.
Other things that you can try:
if you have to use regexes then use the longest fixed character strings you can - this will allow the regex engine to skip non-matching parts of the file faster (it will skip bigger chunks)
avoid line by line processing if possible - the regex engine will not have to spend time looking for the newline character

Try different AWK's: mawk has been particularly fast for me.

Related

very long lines - windows grep character (not a line based) tool

Is there a grep-like tool for windows where I can restrict the number of characters it outputs in a line where a searched for pattern is found.
One of the upstream software systems generates huge text files which we then feed as the input to our system.
Sometimes the input files get corrupted and I need to do a quick textual search to find if particular the bits of data are missing or not. To make it even worse - the input files is just one very very long line of text - and when I use grep or findstr - the result of the search is huge chunk of text.
I am wandering - how can I limit the number of characters grep to show before/after the pattern I searched for.
Cheers.
Two things spring to my mind:
Call grep with the --only-matching option so that only the text that matches is emitted. Depending on your regex, this may or may not help.
Write a very simple executable, call it trunc, which reads from stdin line by line and output the first n characters to stdout. Then simply pipe the output from grep to trunc.
The latter option is relatively simple. If you didn't want to go the whole hog and produce a proper native exe it could be quite easily achieved with a Perl/Python/Ruby etc. script.

Delimiter for meta data in Windows file name

I'm working on maintenance of an application that transfers a file to another system and uses a structured filename to include meta data including a language code. The current app uses a two character language code and a dash/hyphen for a delimiter.
Ex. Canada-EN-ProdName-ProdCode.txt
I'm converting it to use IETF language code and so the dash delimiter won't do and need a replacement. I'm trying to determine a delimiter to avoid future errors and am considering the tilde ~.
Ex. Canada~en-GB~ProdName~ProdCode.txt
This will be use only on Windows Sever 2003 + systems. I certainly didn't come up with this system of parsing a filename to get meta data. Unfortunately, I can't include this in the file itself and the destination system is expecting the language code to be in IETF format with the dash.
Any thoughts on potential issues with using the tilde in the filename, or perhaps a better character to use? I'm just looking for a second opinion in case I'm overlooking a possible failure. I believe windows will use the tilde when shortening a long filename to 8.3 format, but I don't see that as an issue here as the OSs can handle lang filenames.
The tilde is probably fine, but what's wrong with the good old underscore _ ? It has no special meaning on either windows or unix, and makes names that are relatively easy to read. If there are no other special considerations, I would avoid the tilde solely out of paranoia, since windows does use it as a special character sometimes, as you mentioned.
For anyone readiong this question I would strongly recommend anything but the tilde in the file name or at least be careful in testing for any speed problems with any .NET path work where one exists.
I used this as a file name delimiter some time ago. I couldn't understand why simply getting a list of files from the folders was taking so long. It was a number of years later (having written a lot of speed up code that had marginal advantage) that I discovered there is a problem with the (DirectoryInfo(path).name in .NET at least) where simple existience of the tilde was forcing underlying code to through a lot of hoops.
The slow down was substantial (it was over a network so I had thought it was bandwidth/Network issues for a fair while)
I understand this is a legacy overhang for when alternative short versions of filenames could be used for Windows files.
I am now stuck with the tilde in these file names but, given that the problem lay in some of the .NET path functions (I don't actually know if it still does), I could work around it by spotting a tilde and creating my own answers when it existed rather than passing it through.
If in any doubt just run speed tests with and without the tilde in filenames for say just 500-1,000 files.

Convert plain text to latex code programmatically

I'd like to take some user input text and quickly parse it to produce some latex code. At the moment, I'm replacing % with \% and \n with \n\n, but I'm wondering if there are other replacements I should be making to make the conversion from plain text to latex.
I'm not super worried about safety here (can you even write malicious latex code?), as this should only be used by the user to convert their own text into latex, and so they should probably be allowed to used their own latex markup in the pre-converted text, but I'd like to make sure the output doesn't include accidental latex commands if possible. If there's a good library to make such a conversion, I'd take a look.
Apparently, the following characters
\ { } $ ^ _ % ~ # &
are special in LaTeX, so you should make sure to escape them (prefixing with backslash will do for some of them, see Thomas' answer for special cases) or tell your users not to use them unless they deliberately want to use LaTeX commands (or a mix of both, depending on the character).
Some additional pitfalls:
Not every line break in the text might be intended as a new paragraph.
If your users use a language other than English (or Latin), you will need to \usepackage something that deals with the encoding (like utf8) or convert the characters yourself (e.g. รค -> \"a).
As dmckee points out, quotes also need to be treated separately.
EDIT: Since this has become the accepted answer, I also added the points raised in the other answers, so this is now a summary.
As Heinzi said, the following need attention:
\ { } $ ^ _ % ~ # &
Most can be escaped with a backslash, but \ becomes \textbackslash and ~ becomes \textasciitilde.
I think you might want to leave line breaks alone. LaTeX handles these in exactly the same way as many content management systems; many people have come to expect that "double line break" = "paragraph break". Heck, even stackoverflow itself works that way.
(You cannot write malicious LaTeX code; everything that happens inside LaTeX stays inside LaTeX. Unless you explicitly enable write18 when running latex, but it's disabled by default.)
Heinzi has already shown most of the basic characters that need to be escaped, but the hard part here is insuring that the quoting comes out right.
She said "He didn't do it".
needs to be converted to
She said ``He didn't do it''.
which looks easy in this trivial case, but is full of gatcha's that require careful handling. For modest size texts, I generally use a naive substitution generated in sed and diddle the results by hand. Things are both easier and harder if your "plain text" uses curly quotes.
Here "naive quote substitution" means that quotes followed by word characters are replaced by (one or two as appropriate) back ticks, and all others are replaced by (one or two) single-quotes ('). That catches most cases in prose, but you will have to clean up all the triple-quote cases by hand.
Another possible solution is to make all "special" characters into ordinary ones before inserting the user's text. That might avoid many headaches, but might also create new ones...
You can do this by changing the catcode of the character. The TeX Wikibook knows more.
\catcode`\$=12
will turn $ into an ordinary character. However, for some reason some characters don't come out as you'd expect. \ becomes a double open quote, { becomes a dash... and redefining } inside a group ({...}) makes TeX choke entirely.
Long story short: only recommended if you know what you're doing.

How do you convert character case in UNIX accurately? (assuming i18N)

I'm trying to get a feel for how to manipulate characters and character sets in UNIX accurately given the existance of differing locales - and doing so without requiring special tools outside of UNIX standard items.
My research has shown me the problem of the German sharp-s character: one character changes into two - and other problems. Using tr is apparently a very bad idea. The only alternative I see is this:
echo StUfF | perl -n -e "print lc($_);"
but I'm not certain that will work, and it requires Perl - not a bad requirement necessarily, but a very big hammer...
What about awk and grep and sed and ...? That, more or less, is my question: how can I be sure that text will be lower-cased in every locale?
Perl lc/uc works fine for most languages but it won't work with Turkish correctly, see this bug report of mine for details. But if you don't need to worry about Turkish, Perl is good to go.
You can't be sure that text will be correct in every locale. That's not possible, there are always some errors in software libraries regarding implementation of i18n related staff.
If you're not afraid of using C++ or Java, you may take a look at ICU which implement broad set of collation, normalization, etc. rules.

What is the best character to use as a delimiter in a custom batch syntax?

I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?

Resources