Controlling du ordering [duplicate]

Controlling du ordering [duplicate] - shell

This question already has answers here:
What sort order does Linux use?
(1 answer)
Is the order of iteration in bash for loop guaranteed? [duplicate]
(1 answer)
Closed 3 years ago.
Short version; how can I control the order in which du scans directories?
NOTE: Sorting afterwards is not sufficient, I need to change the order that du scans, not the ordering of the results.
For a bit of background; I have a simple rsync-based backup with a bunch of snapshots in a directory, all named by the date/time they were created.
So a sample might look like this:
2019-08
2019-09
2019-09-20
2019-09-21
2019-09-23-161447
2019-09-23-172658
i.e- I've got hourlies (at the end), collapsed into dailies (middle) and then eventually into monthlies (top), which works great since these work well with the natural sorting of ls.
The problem however, is that when I run du . the ordering is unpredictable, or at least, it's not the same (oldest to newest) so the results aren't useful as I'd like; sometimes I'll get a daily or even hourly being scanned ahead of the monthlies, and the ordering only sometimes matches the one I want.
What I want, ideally, is for du to always evaluate the snapshots from oldest to newest, as this way du will give the sizes for all later snapshots as the a size difference/delta (since unchanged files are hard-linked, it only counts the changes for later snapshots).
Is it possible to force a natural sort order for the order in which du scans directories?

For some reason it hadn't occurred to me that if I just pass all the snapshots as arguments, then du would still ignore additional hard-links, so a possible answer in my case is to do something like:
du -sh /path/to/snapshots/*-*
Since my parent directory contains no other entries with hyphens, this expands to every single snapshot within it, and seems to be given in a natural sort order. Depending how similar your own use-case is you may need to adjust the pattern, or obtain the list some other way.
Before I mark this answer as correct though, can I please have someone confirm; is this a reliable way to order the snapshots? It seems to give me a natural sort order, but I haven't been able to find a source confirming if this will always be the case. If it isn't reliable, or even if it is, are there any other/more flexible alternatives?

Related

In bioinformatics, what is a singleton?

I've quickly realized that bioinformatics is not a subject which has its terms clearly defined and easily accessible. I have an apparent discrepancy with some of my results.
I used samtools view -b -h -f 8 fileName.bam > mateUnmapped.bam on several BAM files. I am under the impression that this command extracts only reads whose partner does not align to the draft genome (also includes header; the output is in BAM format)
When I use samtools 'flagstat' on the resulting files, I get an interesting result: the number of 'singletons' do not match the total number of reads... which seems odd to me.
The only reconciliation I can find is here:
http://seqanswers.com/forums/showthread.php?t=46711
One person which replies to the question posed in this forum claims that singletons are sometimes defined as sequences which do not have a partner read at all. However, that still doesn't explain away my result. Flagstat says about 40% of my reads are singletons, but I feel like based on the 'view' command I used, they should ALL be singletons.
Can a seasoned bioinformatician help me out?

In general genomic assembly, a singlton is a read that did not assemble into a contig or map to a reference. It is a contig of only 1 read.
In samtools, a singleton refers to a read that mapped but the mate didn't.
Flagstat says about 40% of my reads are singletons, but I feel like
based on the 'view' command I used, they should ALL be singletons.
I'm not a samtools expert, but I think -f 8 means show reads whose mates did not map. That doesn't say anything about the read itself, just its mate. So you are probably getting reads where both mates that didn't map at all (60%) AND reads where only one of the mates mapped (40%). ?
You might want to try running with -f 8 -F 4 to be reads that mapped but whose mates did not.

Embarrassingly parallel workflow creates too many output files

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?

1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.

If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

idea for practice with shell scripting [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm looking for a shell script idea to work on for practice with shell scripting. Can you please suggest intermediate ideas to work on?
I'm a developer and I prefer working on an idea that deals with files.

For shell scripting, think of a task that you do frequently - and think how you would automate that task.
You can start off with a basic script that just about does what you need. Then you realize that there are small variations on the task, and you start to allow the script to handle those. And it gently becomes more complex.
Almost all of the scripts I have (some hundreds of them) started off as "I've done that before; how can I avoid having to do it again?".
Can you give an example?
No - because I don't know what tasks you do sufficiently often to be (minor) irritants that could be salved by writing a script.
Yes - because I've got scripts that I wrote incrementally, in an attempt to work around some issue or other in my environment.
One task that I'm working on - still a work in progress - is:
Identify duplicate files
Starting at some nominated directory (default, $HOME), find all the files, and for each file, establish a checksum (MD5, SHA1, SHA256 - it is not critical which) for the file; record the file name and checksum (and maybe device number and inode number).
Establish which checksums are repeated - hence identifying identical files.
Eliminate the unique checksums.
Group the duplicate files together with appropriate identifying information.
This much is fairly easy - it requires some medium-grade shell scripting and you might have to find a command to generate the checksum (but you might be OK with sum or cksum, though neither of those reaches even the level of MD5). I've done this in both shell and Perl.
The hard part - where I've not yet gotten a good solution - is then dealing with the duplicates. I have some 8,500 duplicated hashes, with about 27,000 file names in total. Some of the duplicates are images like smileys used in chat transcripts - there are a lot of that particular image. Others are duplicate PDF files collected from various machines at various times; I need to organize them so I have one copy of the file on disk, with perhaps links in the other locations. But some of the other locations should go - they were convenient ways to get the material from retired machines onto my current machine.
I have not yet got a good solution to the second part.

Here are two scripts from my personal library. They are simple enough not to require a full blown programming language, but aren't trivial, particularly if you aim to get all the details right (support all flags, return same exit code, etc.).
cvsadd
Write a script to perform a recursive cvs add so you don't have to manually add each sub-directory and its files. Make it so it detects the file types and adds the -kb flag for binary files as needed.
For bonus points: Allow the user to optionally specify a list of directories or files to restrict the search to. Handle file names with spaces correctly. If you can't figure out if a file is text or binary, ask the user.
#!/bin/bash
#
# Usage: cvsadd [FILE]...
#
# Mass `cvs add' script. Adds files and directories recursively, automatically
# figuring out if they are text or binary. If no file names are specified, looks
# for unversioned files and directories in the current directory.
svnfind
Write a wrapper around find which performs the same job, recursively finding files matching arbitrary criteria, but ignores .svn directories.
For bonus points: Allow other actions besides the default -print. Support the -H, -L, and -P options. Don't erroneously filter out files which simply happen to contain the substring .svn. Make usage identical to the regular find command.
#!/bin/bash
#
# Usage: svnfind [-H] [-L] [-P] [path...] [expression]
#
# Attempts to behave identically to a plain `find' command while ignoring .svn
# directories. Usage is identical to `find'.

You could try some simple CGI scripting. It can be done in shell and involves a lot of here documents, parsing and extracting of form values, a bit of escaping and whatever you want to do as payload. (I do not recommend exposing such a script to the hostile internet, though.)

Question about windows system sorting problem

how can I get the file sequence which is the same as windows file system? Because there are so many file system sorting items: name, size, last modified date time, tag(win 7), rating(win 7), so if I using CFileFind API to simulate the sorting behavior as windows file system is quite difficult. So how can I get the files whose sequence is the same as windows file system??

I'm not sure what CFindFile does, but FindFirstFile and friends returns files in the order they exist in the NTFS directory.
I'm not sure why that would be the most desirable, though, it's not exactly "intuitive" by anyone's definition...

Raymond Chen did a pretty detailed article on "Why do NTFS and Explorer disagree on filename sorting?"
However, note that FindFirstFile() and its relatives don't actually sort the results - it's just giving the files back to you in whatever order the filesystem hands them up. NTFS has an ordering for its own purposes (and I'm not sure that that ordering is specified - that it appears ordered to you is probably just a happy coincidence). FAT file systems and network filesystems will have their own ordering (or no ordering - the files might just be in the directory in whatever order they happened to be created - I think FAT systems are like that).
If you need a particular order for files returned by FindFirstFile() and friends, you'll need to do that sorting yourself.
From the FindFirstFile() docs: "FindFirstFile does no sorting of the search results. For additional information, see FindNextFile."
And from the docs for FindNextFile(): "The order in which the search returns the files, such as alphabetical order, is not guaranteed, and is dependent on the file system. If the data must be sorted, the application must do the ordering after obtaining all the results."
CFileFind() makes no promises about the order of the filenames returned - I'd be astonished if it did any sorting either (since it would have to get all possible files from the destination directory before returning the first one to be able to pull it off).

How to programmatically find the difference between two directories

First off; I am not necessarily looking for Delphi code, spit it out any way you want.
I've been searching around (especially here) and found a bit about people looking for ways to compare to directories (inclusive subdirs) though they were using byte-by-byte methods. Second off, I am not looking for a difftool, I am "just" looking for a way to find files which do not match and, just as important, files which are in one directory but not the other and vice versa.
To be more specific: I have one directory (the backup folder) which I constantly update using FindFirstChangeNotification. Though the first time I need to copy all files and I also need to check the backup directory against the original when the applications starts (in case something happened when the application wasn't running or FindFirstChangeNotification didn't catch a file change). To solve this I am thinking of creating a CRC list for the backed up files and then run through the original directory computing the CRC for every file and finally compare the two CRCs. Then somehow look for files which are in one directory and not the other (again; vice versa).
Here's the question: Is this the fastest way? If so, how would one (roughly) get the job done?

You don't necessarily need CRCs for each file, you can just compare the "last modified" date for every file for most normal purposes. It's WAY faster. If you need additional safety, you can also compare the lengths. You get both of these metrics for free with the find functions.
And in your change notification, you should probably add the files to a queue and use a timer object to copy the new queued files every ~30sec or something, so you don't bog down the system with frequent updates/checks.
For additional speed, use the Win32 functions wherever possible, avoid any Delphi find/copy/getfileinfo functions. I'm not familiar with the Delphi framework but for example the C# stuff is WAY WAY WAY slower than the Win32 functions.

Regardless of you "not looking for a difftool", are you opposed to using Cygwin with it's "diff" command for the shell? If you are open to this its quite easy, particularly using diff with the -r "recursive" option.
The following generates the differences between 2 Rails installs on my machine, and greps out not only information about differences between files but also, specifically by grepping for 'Only', finds files in one directory, but not the other:
$ diff -r pgnindex pgnonrails | egrep '^Only|diff'
Only in pgnindex/app/controllers: openings_controller.rb
Only in pgnindex/app/helpers: openings_helper.rb
Only in pgnindex/app/views: openings
diff -r pgnindex/config/environment.rb pgnonrails/config/environment.rb
diff -r pgnindex/config/initializers/session_store.rb pgnonrails/config/initializers/session_store.rb
diff -r pgnindex/log/development.log pgnonrails/log/development.log
Only in pgnindex/test/functional: openings_controller_test.rb
Only in pgnindex/test/unit: helpers

The fastest way to compare one directory on the local machine to a directory on another machine thousands of miles away is exactly as you propose:
generate a CRC/checksum for every file
send the name, path, and CRC/checksum for each file over the internet to the other machine
compare
Perhaps the easiest way to do that is to use rsync with the "--dryrun" or "--list-only" option.
(Or use one of the many applications that use the rsync algorithm,
or compile the rsync algorithm into your application).
cd some_backup_directory
rsync --dryrun myname#remote_host:latest_version_directory .
For speed, the default rsync assumes, as Blindy suggested, that two files with the same name and the same path and the same length and the same modification time are the same.
For extra safety, you can give rsync the "--checksum" option to ignore the length and modification time and force it to compare (the checksum of) the actual contents of the file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio