Understanding LC_ALL=C and its implications for standard English characters - utf-8

Forgive me for the clumsy way I'm approaching this question, everything I've learnt so far on the topic of character encoding has been in the last few hours and I'm aware I'm out of my depth. This may be answered elsewhere on the site, such as in my linked questions, but if it has, those answers are too dense for me to understand exactly what's being concluded in them.
I often need to grep through folders of excessively large text files (totalling more than 100GB). I've read about how using LC_ALL=C can speed this up considerably, but I want to be sure that doing so won't compromise the accuracy of my searches.
The files are old and have passed through many different online sources, so are likely to contain a jumble of characters from many different encodings, including UTF-8. (As an aside, is it possible for a single file to contain characters from multiple encodings?)
The bulk of what concerns me is this: if I want to search for a given b in my data, can I expect every letter b that's present in the data to be encoded as ASCII, or can the same letter also be encoded as UTF-8?
Or to put it another way, are ASCII characters always and exclusively ASCII? If even standard English characters can be encoded as UTF-8, and using LC_ALL=C grep would disregard all UTF-8 characters, then this would have the implication that my searches would miss search terms that are not in ASCII, which would obviously not be the behaviour that I want, and would be a considerable obstacle to adopting LC_ALL=C for grep.

About understanding UTF-8 vs ASCII, the following are very good
http://kunststube.net/encoding/
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
About difference in time with grep for UTF-8 files with small amount of not ASCII character, there is basically no difference using LC_ALL=C or LANG=C versus the standard LANG=en_US.UTF-8 or similar.
Test performed on Cygwin 64 bit, repeating 1000 times the search on 20GB of text:
$ time for i in $(seq 1000) ; do grep -q LAPTOP-82F08ILC wia-*.log ; done
real 0m53.289s
user 0m7.813s
sys 0m31.635s
$ time for i in $(seq 1000) ; do LC_ALL=C grep -q LAPTOP-82F08ILC wia-*.log ; done
real 0m53.027s
user 0m7.497s
sys 0m31.010s
s
$ ls -sh wia-*
10G wia-1024.log 160M wia-16.log 2.5G wia-256.log 40M wia-4.log 639M wia-64.log
1.3G wia-128.log 20M wia-2.log 320M wia-32.log 5.0G wia-512.log 80M wia-8.log
The difference is within the tolerance of repeatition that was in the 53-55 seconds for both cases

Related

Removing blankspace at the start of a line (size of blankspace is not constant)

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

How long can I expect grep to take on a 10 TB file?

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

Parameter expansion slow for large data sets

If I take the first 1,000 bytes from a file, Bash can replace some characters pretty quick
$ cut -b-1000 get_video_info
muted=0&status=ok&length_seconds=24&endscreen_module=http%3A%2F%2Fs.ytimg.com%2F
yts%2Fswfbin%2Fendscreen-vfl4_CAIR.swf&plid=AATWGZfL-Ysy64Mp&sendtmp=1&view_coun
t=3587&author=hye+jeong+Jeong&pltype=contentugc&threed_layout=1&storyboard_spec=
http%3A%2F%2Fi1.ytimg.com%2Fsb%2FLHelEIJVxiE%2Fstoryboard3_L%24L%2F%24N.jpg%7C48
%2327%23100%2310%2310%230%23default%23cTWfBXjxZMDvzL5cyCgHdDJ3s_A%7C80%2345%2324
%2310%2310%231000%23M%24M%23m1lhUvkKk6sTnuyKXnPBojTIqeM%7C160%2390%2324%235%235%
231000%23M%24M%23r-fWFZpjrP1oq2uq_Y_1im4iu2I%7C320%23180%2324%233%233%231000%23M
%24M%23uGg7bth0q6XSYb8odKLRqkNe7ao&approx_threed_layout=1&allow_embed=1&allow_ra
tings=1&url_encoded_fmt_stream_map=fallback_host%3Dtc.v11.cache2.c.youtube.com%2
6quality%3Dhd1080%26sig%3D610EACBDE06623717B1DC2265696B473C47BD28F.98097DEC78411
95A074D6D6EBFF8B277F9C071AE%26url%3Dhttp%253A%252F%252Fr9---sn-q4f7dney.c.youtub
e.com%252Fvideoplayback%253Fms%253Dau%2526ratebypass%253Dyes%2526ipbits%253D8%25
26key%253Dyt1%2526ip%253D99.109.97.214%2
$ read aa < <(cut -b-1000 get_video_info)
$ time set "${aa//%/\x}"
real 0m0.025s
user 0m0.031s
sys 0m0.000s
However if I take 10,000 bytes it slows dramatically
$ read aa < <(cut -b-10000 get_video_info)
$ time set "${aa//%/\x}"
real 0m8.125s
user 0m8.127s
sys 0m0.000s
I read Greg Wooledge’s post but it lacks an explanation as to why Bash parameter expansion is slow.
For the why, you can see the implementation of this code in pat_subst in subst.c in the bash source code.
For each match in the string, the length of the string is counted numerous times (in pat_subst, match_pattern and match_upattern), both as a C string and more expensively as a multibyte string. This makes the function both slower than necessary, and more importantly, quadratic in complexity.
This is why it's slow for larger input, and here's a pretty graph:
As for workarounds, just use sed. It's more likely to be optimized for string replacement operations (though you should be aware that POSIX only guarantees 8192 bytes per line, even though GNU sed handles arbitrarily large ones).
Originally, older shells and other utilities imposed LINE_MAX = 2048
on file input for this kind of reason. For huge variables bash has no
problem parking them in memory. But substitution requires at least two
concurrent copies. And lots of thrashing: as groups of characters are
removed whole strings get rewritten. Over and over and over.
There are tools meant for this - sed is a premiere choice. bash is a
distant second choice. sed works on streams, bash works on memory blocks.
Another choice:
bash is extensible - your can write custom C code to stuff stuff well
when bash was not meant to do it.
CFA Johnson has good articles on how to do that:
Some ready to load builtins:
http://cfajohnson.com/shell/bash/loadables/
DIY builtins explained:
http://cfajohnson.com/shell/articles/dynamically-loadable/

Is there a good two way hash to convert an email address to a predictable, readable, unix username?

We are working with a number of unix based filesystems, all of which share a similar set of restrictions on that certain characters can't be used in the username fields. One of those restrictions is no "#" , "_", or "." in the names. Being unix there are a number of other restrictions.
So the question is if there is a good known algorithm that can take an email address and turn that into a predictable unix filename. We would need to reverse this at some point to get the email.
I've considered doing thing like "."->"DOT", "#"->"AT", etc. But there are size limitations and other things that are generally problematic. I could also optimize by being able to map the #xyz.com part of the email to a special char or something. Each implementation would only have at most 3 domains it would need to support. I'm hoping someone has found a solution without a huge number of tradeoffs.
UPDATE:
-The two target filesystems are AFS and NFS.
-Base64 doesn't work as it has not compatible characters. "/"
-Readable is preferable.
Seems like the best answer would be to replace the #xyz.com domain to a single non-standard character, and then have a function that could shrink the first part of a name to something that fits in the username length restrictions of the various filesystems. But what is a good function for that?
You could try a modified version of the URL percent (%) encoding scheme used on for URIs.
If the percent symbol isn't allowed on your particular filesystem(s), simply replace it with a different, allowed character (and remember to encode any occurrences of that character properly).
Using this method:
mail.address#server.com
Would become:
mail%2Eaddress%40server%2Ecom
Or, if you had to substitute (for example), the letter a instead of the % symbol:
ma61ila2Ea61ddressa40servera2Ecom
Not exactly humanly-readable perhaps, but easily enough processed through an encoding algorithm. For the best space efficiency, your escape character should be a character allowed by the filesystem, yet one that is not likely to appear frequently in an address.
This encoding scheme has the advantage that there is no size increase for most normal characters. The string length will ONLY go up for characters not supported by the filesystem.
Check out base64. Encoding and decoding is well defined.
I'd prefer this over rolling my own format any day.
Hmm, from your question I'm not totally clear on this point, but since you wanted some conversion I'm assuming that you want something that is at least human readable?
Each OS may have different restrictions, but are you close enough to the platforms that you would be able to find out/test what is acceptable in a username? If you could find three 'special' characters that you could use just to do a replace on '#', '.', '_' you would be good to go. (Is that comprehensive? if not you would need to make sure you know all of them otherwise you could clash.) I searched a bit trying to find whether there was a POSIX standard, but wasn't able to find anything, so that's why I think if you can just test what's valid that would be the most direct route.
With even one special character, you could do URL encoding, either with '%' if it's available, or whatever you choose if not, say '!", then { '#'->'!40", '_'->'!5F', '.'-> '!2E' }. (The spec [RFC1738] http://www.rfc-editor.org/rfc/rfc1738.txt) defines the characters as US-ASCII so you can just find a table, e.g. in wikipedia's ASCII article and look up the correct hex digits there.) Or, you could just do your own simple mapping since you don't need the whole ASCII set, you could just do a map with two characters per escaped character and have, say, '!a','!u','!p' for at, underscore, period.
If you have two special characters, say, '%', and '!', you could delimit text that represents the character, say, %at!, &us!, and '&pd!'. (This is pretty much html-style encoding, but instead of '&' and ';' you are using the available ones, and you're making up your own mnemonics.) Another idea is that you could use runs of a symbol to determine the translated character, where each new character flops which symbol is being used. (This conveniently stops the run if we need to put two of the disallowed characters next to each other.) So assume '%' and '!', with period being 1, underscore 2, and at-sign being three, 'mickey._sample_#fake.out' would become 'mickey%!!sample%%!!!fake%out'. There are other variations but this one is easy to code.
If none of this is an option (e.g. no symbols at all, just [a-zA-Z0-9]), then really I think the Base64 answer sounds about right. Really once we're getting to anything other than a simple replacement (and even that) it's already getting hard to type if that's the goal. But if you really need to try to keep the email mostly readable, what you do is implement some sort of escaping. I'm thinking use '0' as your escape character, so now '0' becomes '00', '#' becomes '01', '.' becomes '02', and '_' becomes '03'. So now, 'mickey01._sample_#fake.out'would become 'mickey0010203sample0301fake02out'. Not beautiful but it should work; since we escaped any raw 0's, just always make sure you define a mapping for whatever you choose as your escape char and you should be fine..
That's all I can think of atm. :) Definitely if there's no need for these usernames to be readable in the raw it seems like apparently Base64 won't work, since it can produce slashes. Heck, ok, just the 2-digit US-ASCII hex value for each character and you're done...] is a good way to go; there's lots of nice debugged, heavily field-tested code out there for it and it solves your problem quite handily. :)
Given...
- the limited set of characters allowed in various file systems
- the desire to keep the encoded email address short (both for human readability and for possible concerns with file system limitations)
...a possible approach may be a two steps encoding logic whereby the email is
first compressed using a lossless compression algorithm such as Lempel-Ziv, effectively turning it into a "binary" form, stored in a shorter array of bytes
then this array of bytes is encoded using a Base64-like algorithm
The idea is to minimize the size of the binary representation, so that the expansion associated with the storage inefficiency of the encoding -which can only store roughly 6 bits (and probably a bit less) per character-, doesn't cause the encoded string to be too long.
Without getting overly sophisticated for the compression nor the encoding, such a system would likely produce encoded strings that are maybe 4/5 of the input string size (the email address): the compression should easily half the size, but the encoding, say Base32, would grow the binary form size by 8/5.
Efforts in improving the compression ratio may allow the selection of more "wasteful" encoding schemes (with smaller character sets) and this may help making the output more human-readable and also more broadly safe on various flavors of file systems. For example whereby a Base64 seems optimal. space-wise, using only uppercase letter (base 26) may ensure portability of the underlying scheme to file systems where the file names are not case sensitive.
Another benefit of the initial generic compression is that few, if any, assumptions need to be made about the syntax of valid input key (email addresses here).
Ideas for compression:
LZ seems like a good choice, 'though one may consider primin its initial buffer with common patterns found in email addresses (example ".com" or even "a.com", "b.com" etc.). This initial buffer would ensure several instances of "citations" per compressed email address, hence a better compression ratio overall). To further squeeze a few bytes, maybe LZH or other LZ-variations could be used.
Aside from the priming of the buffer mentioned above, another customization may be to use a shorter buffer than typical LZ algorithms, since the string we have to compress (email address instances) are themselves very short and would not benefit from say a 512 bytes buffer. (Shorter buffer sizes allow shorter codes for the citations)
Ideas for encoding:
Base64 is not suitable as-is because of the slash (/), plus (+) and equal (=) characters. Alternate characters could be used to replace these; dash (-) comes to mind, but finding three charcters, allowed by all "flavors" of the targeted file systems may be a stretch.
Never the less, Base64 and its 4 output characters per 3 payload bytes ratio provide what is probably the barely achievable upper limit of storage efficiency [for an acceptable character set].
At the lower end of this efficiency, is maybe an ASCII representation of the Hexadeciamal values of the bytes in the array. This format with a doubling of the payload bytes may be acceptable, length-wise, and is interesting because of its simplicity (there is a direct and simple relation between each nibble (4 bits) in the input and characters in the encoded string.
Base32 whereby A thru Z encode 0 thru 25 and 0 thru 5 encode 26 thru 31, respectively, essentially variation of Base64 with an 8 output characters per 5 payload bytes ratio may be a very viable compromise.

Fast, Secure Random Numbers

I was searching for a faster alternative to /dev/urandom when I stumbled across this interesting tidbit:
One good trick for generating very good non-random-but-nearly-random bits is to use /dev/random's entropy to seed a fast symmetric stream cipher (my favorite is blowfish), and redirect it's output to the application that needs it.
That's not a beginners technique, but it's easy to set up with a two or three line shell script and some creative pipes.
Further research yielded this comment from Schneier on Security:
If you are going to "inject entropy" there are a number of ways to do it but one of the better ways is to "spread" it across a high speed stream cipher and couple it with a non determanistic sampling system.
Correct me if I'm wrong, but it appears that this method of generating random bits is simply better than /dev/urandom in terms of speed and security.
So, here is my take on the actual code:
time dd if=/dev/zero bs=1M count=400 | openssl bf-ofb -pass pass:`cat /dev/urandom | tr -dc [:graph:] | head -c56` > /dev/null
This speed test takes 400MB of zeroes and encrypts it using blowfish with a 448 bit key made of pseudo-random, printable characters. Here's the output on my netbook:
400+0 records in
400+0 records out
419430400 bytes (419 MB) copied, 14.0068 s, 29.9 MB/s
real 0m14.025s
user 0m12.909s
sys 0m2.004s
That's great! But how random is it? Lets pipe the results to ent:
Entropy = 8.000000 bits per byte.
Optimum compression would reduce the size
of this 419430416 byte file by 0 percent.
Chi square distribution for 419430416 samples is 250.92, and randomly
would exceed this value 50.00 percent of the times.
Arithmetic mean value of data bytes is 127.5091 (127.5 = random).
Monte Carlo value for Pi is 3.141204882 (error 0.01 percent).
Serial correlation coefficient is -0.000005 (totally uncorrelated = 0.0).
It looks good. However, my code has some obvious flaws:
It uses /dev/urandom for the initial entropy source.
Key strength is not equivalent to 448 bits because only printable characters are used.
The cipher should be periodically re-seeded to "spread" out the entropy.
So, I was wondering if I am on the right track. And if anyone knows how to fix any of these flaws that would be great. Also, could you please share what you use to securely wipe disks if it's anything other than /dev/urandom, sfill, badblocks, or DBAN?
Thank you!
Edit: Updated code to use blowfish as a stream cipher.
If you're simply seeking to erase disks securely, you really don't have to worry that much about the randomness of the data you write. The important thing is to write to everything you possibly can - maybe a couple of times. Anything much more than that is overkill unless your 'opponent' is a large government organization with the resources to spare to indulge in the data recovery (and it is not clear cut that they can read it even so - not these days with the disk densities now used). I've used the GNU 'shred' program - but I'm only casually concerned about it. When I did that, I formatted a disk system onto the disk drive, then filled it with a single file containing quasi-random data, then shredded that. I think it was mostly overkill.
Maybe you should read Schneier's 'Cryptography Engineering' book?

Resources