read file name from text file in ksh and compress - shell

I want to write a KornShell (ksh) script to read filename from a textfile(having list of prefixes) and compress it if it is found in the directory by looping.
ex: I will keep the prefix 'abcd' in the text file .
By reading it, I want to compress the files in the directory matching the name like this abcd###.YYYYMMDDXXXXXX.txt
I do not want to do anything with same prefix but with different extensions or different patterns like abcd###.YYYYMMDDXXXXXX.dat or abcd###.YYYYMMDDXXXXXX.txt.Z. I only want to compress matching like this abcd###.YYYYMMDDXXXXXX.txt only.
How to implement this in ksh?

Superficially, this should do:
: ${COMPRESS:=xz}
while read prefix
do
$COMPRESS ${prefix}[0-9][0-9][0-9].[12][09][0-9][0-9][01][0-9][0-3][0-9]??????.txt
done < file
Obviously, I'm having to make some guesses, that # means a digit, that the YYYYMMDD is a date, and that X is any character (that's ? in the answer). If X is meant to be any upper-case alphabetic, or something else, adjust accordingly. The year rules will accept 19xx and 20xx (also 10xx and 29xx, but you're unlikely to have files dated like that; the month rules accept 00..19; the day rules accept 00..39. If you have to validate more, then you can't readily use a simple globbing regex.
I used xz as the compress program. I would not use the compress program for compression as it simply doesn't compare with gzip, let alone bzip2 or xz, etc.

Related

How to split single mail with procmail?

I have a quarantine folder that I periodically have to download and split by recipient inbox or even better split each message in a text file. I have c.a. 10.000 mails per day and I'm coding something with fetchmail and procmail. The problem is that i can't find out how to split message-by-message in procmail; they all end up in the same inbox.
I tried to pass every message in a script via a recipe like:
:0
| script_processing_messages.sh
Which contained
read varname
echo "$varname" > test_file
To try to see if I could obtain a single message in the $varname variable but nope, I only obtain a single line of a message each time.
Right now I use
fetchmail --keep
where .fetchmailrc is
poll mail.mymta.my protocol pop3 username "my#inbox.com" password "****" mda "procmail /root/.procmailrc"
and .procmailrc is
VERBOSE=0
DEFAULT=/root/inbox.quarantine
I would like to obtain a file for each message, so:
1.txt
2.txt
3.txt
[...]
10000.txt
I have many recipients and many domains, so I can't let's say write 5000 rules to match every recipient. It would be good if there was some kind of
^To: $USER
that redirect to
/$USER.inbox
so that procmail itself takes care of reading and creating dinamically these inbox
I'm not very expert in fetchmail and procmail recipes, I'm trying hard but I'm not going so far.
You seem to have two or three different questions; proper etiquette on Stack Overflow would be to ask each one separately - this also helps future visitors who have just one of your problems.
First off, to split a Berkeley mbox file containing multiple messages and run Procmail on each separately, try
formail -s procmail -m <file.mbox
You might need to read up on the mailbox formats supported by Procmail. A Berkeley mailbox is a single file which contains multiple messages, simply separated by a line beginning with From (with a space after the four alphabetic characters). This separator has to be unique, and so a message which contains those five characters at beginning of a line in the body will need to be escaped somehow (typically by writing a > before From).
To save each message in a separate file, choose a different mailbox format than the single-file Berkeley format. Concretely, if the destination is a directory, Procmail will create a new file in that directory. How exactly the new file is named depends on the contents of the directory (if it contains the Maildir subdirectories new, tmp, and cur, the new file is created in new in accordance with Maildir naming conventions) and on how exactly the directory is specified (trailing slash and dot selects MH format; otherwise, mail directory format).
Saving to one mailbox per recipient has a number of pesky corner cases. What if the message was sent to more than one of your local recipients? What if the recipient address is not visible in the headers? etc (the Procmail Mini-FAQ has a section about this, in the context of virtual hosting of a domain, which this is basically a variation of). But if we simply ignore these, you might be able to pull it off with something like
:0 # whitespace before ] is a literal tab
* ^TO_\/[^ # ]+#(yourdomain\.example|example\.info)\>
{
# Trim domain part from captured MATCH
:0
* MATCH ?? ^\/[^#]+
./$MATCH/
}
This will capture into $MATCH the first address which matches the regex, then perform another regex match on the captured string to capture just the part before the # sign. This obviously requires that the addresses you want to match are all in a set of specific domains (here, I used yourdomain.example and example.info; obviously replace those with your actual domain names) and that capturing the first matching address is sufficient (so if a message was To: alice#yourdomain.example and Cc: bob#example.info, whichever one of those is closer to the top of the message will be picked out by this recipe, and the other one will be ignored).
In some more detail, the \/ special token causes Procmail to copy the text which matched the regex after this point into the internal variable MATCH. As this recipe demonstrates, you can then perform a regex match on that variable itself to extract a substring of it (or, in other words, discard part of the captured match).
The action ./$MATCH/ uses the captured string in MATCH as the name of the folder to save into. The leading ./ specifies the current directory (which is equal to the value of the Procmail variable MAILDIR) and the trailing / selects mail directory format.
If your expected recipients cannot be constrained to be in a specific set of domains or otherwise matched by a single regex, my recommendation would be to ask a new question with more limited scope, and enough details to actually identify what you want to accomplish.
I found a solution to a part of my problem.
It seems that there is no way in procmail to let procmail itself recognize the For recipient without specifying it in a recipe, so I just obtained a list and create a huge recipe file.
But then I just discovered that to save single mails and to avoid huge mailboxes filled with a lot of mails, one could just write a recipe like:
:0
* ^To: recipient#mail.it
/inbox/folder/recipient#mail.it/
Note the / at the end: this will make procmail creating a folder structure instead of writing everywhing in a single file.

Is there any character that is illegal in file paths on every OS?

Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.
In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].
You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.

How to reversibly escape a URL in Ruby so that it can be saved to the file system

The use-case example is saving the contents of http://example.com as a filename on your computer, but with the unsafe characters (i.e. : and /) escaped.
The classic way is to use a regex to strip all non-alphanumeric-dash-underscore characters out, but then that makes it impossible to reverse the filename into a URL. Is there a way, possibly a combination of CGI.escape and another filter, to sanitize the filename for both Windows and *nix? Even if the tradeoff is a much longer filename?
edit:
Example with CGI.escape
CGI.escape 'http://www.example.com/Hey/whatsup/1 2 3.html#hash'
#=> "http%3A%2F%2Fwww.example.com%2FHey%2Fwhatsup%2F1+2+3.html%23hash"
A couple things...are % signs completely safe as file characters? Unfortunately, CGI.escape doesn't convert spaces in a malformed URL to %20 on the first pass, so I suppose any translation method would require changing all spaces to + with a gsub and then applying CGI.escape
One of the ways is by "hashing" the filename. For example, the URL for this question is: https://stackoverflow.com/questions/18234870/how-to-reversibly-escape-a-url-in-ruby-so-that-it-can-be-saved-to-the-file-syste. You could use the Ruby standard library's digest/md5 library to hash the name. Simple and elegant.
require "digest/md5"
foldername = "https://stackoverflow.com/questions/18234870/how-to-reversibly-escape-a-url-in-ruby-so-that-it-can-be-saved-to-the-file-syste"
hashed_name = Digest::MD5.hexdigest(foldername) # => "5045cccd83a8d4d5c4fc01f7b4d8c502"
The corollary for this scheme would be that MD5 hashing is used to validate the authenticity/completeness of downloads since for all practical purposes, the MD5 digest of the string always returns the same hex-string.
However, I won't call this "reversible". You need to have a custom way to look up the URLs for each of the hashes that get generated. May be, a .yml file with that data.
update: As #the Tin Man suggests, a simple SQLite db would be much better than a .yml file when there are a large number of files that need storing.
here is how I would do it (adjust the regular expression as needed):
url = "http://stackoverflow.com/questions/18234870/how-to-reversibly-escape-a-url-in-ruby-so-that-it-can-be-saved-to-the-file-syste"
filename = url.each_char.map {|x|
x.match(/[a-zA-Z0-9-]/) ? x : "_#{x.unpack('H*')[0]}"
}.join
EDIT:
if the length of the resulting file name is a concern then I would store the files in sub-directories with the same names as the url path segments.

How to change the extension of each file in a list with multiple extensions in GNU make?

In a GNU makefile, I am wondering if it is possible, with an file list input, to make a file list output with new extensions.
In input, I get this list:
FILES_IN=file1.doc file2.xls
And I would like to build this variable in my makefile from FILES_IN variable:
FILES_OUT=file1.docx file2.xlsx
Is it possible ? How ?
It's quite difficult because I have to parse file list, and detect each extension (.doc, .xls) to replace it to correct extension.
Substituting extensions in a list of whitespace-separated file names is a common requirement, and there are built-in features for this. If you want to add an x at the end of every name in the list:
FILES_OUT = $(FILES_IN:=x)
The general form is $(VARIABLE:OLD_SUFFIX=NEW_SUFFIX). This takes the value of VARIABLE and replaces OLD_SUFFIX at the end of each word that ends with this suffix by NEW_SUFFIX (non-matching words are left unchanged). GNU make calls this feature (which exists in every make implementation) substitution references.
If you just want to change .doc into .docx and .xls into .xlsx using this feature, you need to use an intermediate variable.
FILES_OUT_1 = $(FILES_IN:.doc=.docx)
FILES_OUT = $(FILES_OUT_1:.xls=.xlsx)
You can also use the slightly more general syntax $(VARIABLE:OLD_PREFIX%OLD_SUFFIX=NEW_PREFIX%NEW_SUFFIX). This feature is not unique to GNU make, but it is not as portable as the plain suffix-changing substitution.
There is also a GNU make feature that lets you chain multiple substitutions on the same line: the patsubst function.
FILES_OUT = $(patsubst %.xls,%.xlsx,$(patsubst %.doc,%.docx,$(FILES_IN)))

Best way to read output of shell command

In Vim, What is the best (portable and fast) way to read output of a shell command? This output may be binary and thus contain nulls and (not) have trailing newline which matters. Current solutions I see:
Use system(). Problems: does not work with NULLs.
Use :read !. Problems: won’t save trailing newline, tries to be smart detecting output format (dos/unix/mac).
Use ! with redirection to temporary file, then readfile(, "b") to read it. Problems: two calls for fs, shellredir option also redirects stderr by default and it should be less portable ('shellredir' is mentioned here because it is likely to be set to a valid value).
Use system() and filter outputs through xxd. Problems: very slow, least portable (no equivalent of 'shellredir' for pipes).
Any other ideas?
You are using a text editor. If you care about NULs, trailing EOLs and (possibly) conflicting encodings, you need to use a hex editor anyway?
If I need this amount of control of my operations, I use the xxd route indeed, with
:se binary
One nice option you seem to miss is insert mode expression register insertion:
C-r=system('ls -l')Enter
This may or may not be smarter/less intrusive about character encoding business, but you could try it if it is important enough for you.
Or you could use Perl or Python support to effectively use popen
Rough idea:
:perl open(F, "ls /tmp/ |"); my #lines = (<F>); $curbuf->Append(0, #lines)

Resources