GNU split (UNIX command) creating files not matching pattern after reaching "z" - shell

So I was spliting some large files, everything worked properly until a file of 81GB came to scene. The split command seems that made its job, but the last files has a non correlated name. Look at the right bottom of picture.
And I'm using the command like this:
split -b 125M ./2014.txt 2014/2014_
Anyone knows why instead of create the file 2014_za created the 2014_zaaa?

You can only have 676 files named [a-z][a-z], while your command required more.
Here are some options for what split could do:
Crash.
This is the behavior mandated by POSIX, and followed by macOS.
Start writing larger suffixes.
This is a bad choice because after _zz comes _aaa, but now the files will show up in the wrong order in ls and cat * will no longer join them in correct order.
Save the last range, _z, for longer suffixes.
This is a good choice because after _yz comes _zaaa, which has room to grow while still remaining in alphabetical order. This is what GNU does, and the behavior you're seeing.
If you want all the names to be uniform without triggering any of these behaviors, just use a larger suffix length with -a 6 to ensure you have enough room.

Related

How do I abstract common code between three bash scripts?

I've got three bash scripts in three different sibling directories.
The first few lines of each do some setup, different between each one.
The last twenty or so lines of the scripts are character for character identical, processing and comparing the files constructed in the first bit.
What I'd like to do is to put the last twenty lines in, say ../common.bash, and do something like
#include "../common.bash"
in each of the three scripts, so as to avoid having to make the same changes in three places every time I fiddle.
So far my best guess is to use cat to construct the scripts out of the four morally-independent pieces.
Is there a better way?
Use the source.
source /path/to/common.bash
You shouldn't use a relative path, because it will be interpreted relative to the user's working directory, not the location of the script.
Use meld
source is probably the answer I wanted, but actually in this case I've found that it's best to use meld to view the three files side by side, and to use meld to propagate favourite changes.
The advantage is that when working on one file, I can see the whole thing at once.
But it won't scale to the inevitable fourth copy, so at that point I'll use source, I guess.

Diff for 3 binary files

I have 3 binary files. Let's call them file1.bin, file2.bin and file3.bin.
file1.bin and file2.bin have some common parts.
file2.bin and file3.bin have some common parts.
I want to find the common parts between file1.bin and file2.bin that are different between file2.bin and file3.bin.
How do you recommend to accomplish that? I have already dumped the binary files to text files using xxd and then did a 3-way diff using vim -d file1.txt file2.txt file3.txt.
However, vim marks a part as changed in all the files even if it has only changed in one file and remains the same in the other two files. I want those special kind of occurrences to be marked differently.
Perhaps you can use the built-in unix diff (I think it is part of OSX), but use the --unchanged-group-format to list the similarities. Do that for file1 and file 2. Then do it for file2 and file3. You can then do a regular diff on the two resulting files.
For an idea of how to get the similarities, have a look at this post.
The tool that I work for (ECMerge) does that. You just have to diff the 3 binary files, it will present equal portions in front of each other, and modified bytes appropriately placed in between. No need to first get an hex dump. You can script in JavaScript to output whatever you like based on the diff results and the bytes in the files (it works also in command line).
Chromium uses bsdiff, then switched to courgette for doing binary diff as explained in their blog here. You might find useful leads from their blog.

Can I automatically update msgids in gettext's .po files for trivial text changes?

With gettext, the original (usually English) text of messages serves as
the message key ("msgid") for the translations. This means that every time the
original text changes, the msgid must be updated in all the .po files.
For real changes of the text, this is obviously unavoidable, as the
translator must update the translation.
However, if the change of the original does not change its meaning,
re-translation is superflous (e.g. change in punctation, whitespace
changes, or correction of a spelling mistake).
Is there a way to update the .po files automatically in that case?
I tried to use xgettext & msgmerge (with fuzzy matching turned on), but
fuzzy matching sometimes fails, plus this produces lots of ugly
"#,fuzzy" flags.
Note: There is a similar question:
How to efficiently work with gettext PO files when making small edits to large text values
However, it's about large strings, thus about a more specific problem.
One way to avoid the problem is to leave the msgids alone, have a .po file for the original language and make the fix inside that.
It always strikes me as being more of a work around than a proper fix though. For the next iteration (where there will definitely be more msgid changes) the msgid is changed and either the translators translate it in their usual update or each language is updated by hand when the msgid is changed.
I've had exactly this issue when doing minor changes to a django project. What I do is the following:
Change message in code.
Run find and replace on all translation files ("django.po"), replacing the old message (msgid) with the new one.
Run django-admin makemessages.
If I have done things right, the last step is superflous (i.e, you have done the change for gettext). django uses the gettext utilities, so it shouldn't matter how you make your message files.
I find and replace like so:
find . -name "*.po" -print | xargs sed -i 's/oldmessageid/newmessageid/g' Courtesy of http://rushi.vishavadia.com/blog/find-replace-across-multiple-files-in-linux

Prepending to a multi-gigabyte file

What would be the most performant way to prepend a single character to a multi-gigabyte file (in my practical case, a 40GB file).
There is no limitation on the implementation to do this. Meaning it can be through a tool, a shell script, a program in any programming language, ...
There is no really simple solution. There are no system calls to prepend data, only append or rewrite.
But depending on what you're doing with the file, you may get away with tricks.
If the file is used sequentially, you could make a named pipe and put cat onecharfile.txt bigfile > namedpipe and then use "namedpipe" as file. The same can be achieved by cat onecharfile.txt bigfile | program if your program takes stdin as input.
For random access a FUSE filesystem could be done, but probably waay too complicated for this.
If you want to get your hands really dirty, figure out howto
allocate a datablock (about inode and datablock structure)
insert it into a file's chain as second block (or first and then you're practically done)
write the beginning of file into that block
write the single character as first in file
mark first block as if it uses only one byte of available payload (this is possible for last block, I don't know if it's possible for blocks in middle of file chain).
This has possibilities to majorly wreck your filesystem though, so not recommended; good fun.
Let the file have an initial block of null characters. When you prepend a character, read the block, insert the character right-to-left, and write back the block. When the block is full, then do the more expensive full rewrite in order to prepend another null block. That way, you can reduce the number of times by a large factor that you have to do a full rewrite.
Added: Keep the file in two subfiles: A (a short one) and B (a long one). Prepend to A any way you like. When A gets "big enough", prepend A to B (by re-writing), and clear A.
Another way: Keep the file as a directory of small files ..., A000003, A000002, A000001.
Just prepend to the largest-numbered file. When it's big enough, make the next file in sequence.
When you need to read the file, just read them all in descending order.
You might be able to invert your implementation depending on your problem: append single characters to the end of your file. When it comes time to read the file, read it in reverse.
Hide this behind enough of an abstraction layer and it may not make a difference to your code how the bytes are physically stored.
If you use linux you could try to use a custom version of READ(2) loaded with LD_PRELOAD and have it prepend your data at the first read.
See https://zlibc.linux.lu/zlibc.html for implementation inspiration.
if you mean prepend that character to the start of the entire file, one way
$ echo "C" > tmp
$ cat my40gbfile >> tmp
$ mv tmp my40gbfile
or using sed
$ sed -i '1i C' my40gbfile
if you mean prepending the character to every line of the file
$ awk '{print "C"$0}' my40gbfile > temp && mv temp my40gbfile
As I understand, this is handled on the file system level, meaning if you prepend data to a file, it effectively rewrites the file. This is the same reason why the ID3 tags in MP3 files are zero padded, so that future updates don't rewrite the entire file, but just update those reserved bytes.
So whichever way you use will give roughly similar results. What you can try is do some tests with a custom copy function, that reads/writes in bigger chunks than the default system copy, say 2MB or 5MB, which might improve performance. Ultimately your disk I/O is the bottleneck here.
The absolutely most high-performance way would seem to be to get down into the level of sectors and how the file is actually stored. I'm not sure if the OS then becomes a factor, but the target platform might, anyway it's useful for us to know what you run on.
I think this is a case where C is the obvious choice, this kind of low-level stuff is exactly what a systems programming language is for.
Can you tell us what you end up doing, would be interesting.
Here's the Windows command line ("DOS") way:
Put your 1 char into prepend.txt
copy /b prepend.txt + myHugeFile fileNameOfCombinedFile

Are there any invalid linux filenames?

If I wanted to create a string which is guaranteed not to represent a filename, I could put one of the following characters in it on Windows:
\ / : * ? | < >
e.g.
this-is-a-filename.png
?this-is-not.png
Is there any way to identify a string as 'not possibly a file' on Linux?
There are almost no restrictions - apart from '/' and '\0', you're allowed to use anything. However, some people think it's not a good idea to allow this much flexibility.
An empty string is the only truly invalid path name on Linux, which may work for you if you need only one invalid name. You could also use a string like "///foo", which would not be a canonical path name, although it could refer to a file ("/foo"). Another possibility would be something like "/dev/null/foo", since /dev/null has a POSIX-defined non-directory meaning. If you only need strings that could not refer to a regular file you could use "/" or ".", since those are always directories.
Technically it's not invalid but files with dash(-) at the beginning of their name will put you in a lot of troubles. It's because it has conflicts with command arguments.
I personally find that a lot of the time the problem is not Linux but the applications one is using on Linux.
Take for example Amarok. Recently I noticed that certain artists I had copied from my Windows machine where not appearing in the library. I check and confirmed that the files were there and then I noticed that certain characters in the folder names (Named for the artist) were represented with a weird-looking square rather than an actual character.
In a shell terminal the filenames look even stranger: /Music/Albums/Einst$'\374'rzende\ Neubauten is an example of how strange.
While these files were definitely there, Amarok could not see them for some reason. I was able to use some shell trickery to rename them to sane versions which I could then re-name with ASCII-only characters using Musicbrainz Picard. Unfortunately, Picard was also unable to open the files until I renamed them, hence the need for a shell script.
Overall this a a tricky area and it seems to get very thorny if you are trying to synchronise a music collection between Windows and Linux wherein certain folder or file names contain funky characters.
The safest thing to do is stick to ASCII-only filenames.

Resources