get progress of a file being uploaded - unix - bash

i have a requirement to monitor the progress of a file being uploaded using script. In putty(software) we can view the Percentage Uplaod, Bytes transferred , Upload Speed and ETA in the right hand side. I want to develop a similar functionality. Is there any way to achieve this?

Your question is lacking any information of how your file is transferred. Most clients have some way to display progress, but that is depending on the individual client used (scp, sftp, ncftp, ...).
But there is a way to monitor progress independently on what is progressing: pv (pipe viewer).
This tool has the sole purpose of generating monitoring information. It can be used in a way much similar to cat. You either use it to "lift" a file to pv's stdout....
pv -petar <file> | ...
...or you use it in the middle of a pipe -- but you need to manually provide the "expected size" in order to get a proper progress bar, since pv cannot determine the size of the transfer beforehand. I used 2 Gigabyte expected size here (-s 2G)...
cat <file> | pv -petar -s 2G | ...
The options used are -p (progress bar), -e (ETA), -t (elapsed time), -a (average rate), and -r (current rate). Together they make for a nice mnemonic.
Other nice options:
-L, which can be used to limit the maximum rate in the pipe (throttle).
-W, to make pv wait until data is actually transferred before showing a progress bar (e.g. if the tool you are piping the data to will require a password first).
This is most likely not what you're looking for (since chances are the transfer client you're using has its own options for showing progress), but it's the one tool I know that could work for virtually any kind of data transfer, and it might help others visiting this question in the future.

Related

Batch Convert Sony raw ".ARW" image files to .jpg with raw image settings on the command line

I am looking to convert 15 Million 12.8 mb Sony .ARW files to .jpg
I have figured out how to do it using sips on the Command line BUT what I need is to make adjustments to the raw image settings: Contrast, Highlights, Blacks, Saturation, Vibrance, and most importantly Dehaze. I would be applying the same settings to every single photo.
It seems like ImageMagick should work if I can make adjustments for how to incorporate Dehaze but I can't seem to get ImageMagick to work.
I have done benchmark testing comparing Lightroom Classic / Photoshop / Bridge / RAW Power / and a few other programs. Raw Power is fastest by far (on a M1 Mac Mini 16GB Ram) but Raw Power doesn't allow me to process multiple folders at once.
I do a lot of scripting / actions with photoshop - but in this case photoshop is by far the slowest option. I believe this is because it opens each photo.
That's 200TB of input images, without even allowing any storage space for output images. It's also 173 solid days of 24 hr/day processing, assuming you can do 1 image per second - which I doubt.
You may want to speak to Fred Weinhaus #fmw42 about his Retinex script (search for "hazy" on that page), which does a rather wonderful job of haze removal. Your project sounds distinctly commercial.
© Fred Weinhaus - Fred's ImageMagick scripts
If/when you get a script that does what you want, I would suggest using GNU Parallel to get decent performance. I would also think you may want to consider porting, or having ported, Fred's algorithm to C++ or Python to run with OpenCV rather than ImageMagick.
So, say you have a 24-core MacPro, and a bash script called ProcessOne that takes the name of a Sony ARW image as parameter, you could run:
find . -iname \*.arw -print0 | parallel --progress -0 ProcessOne {}
and that will recurse in the current directory finding all Sony ARW files and passing them into GNU Parallel, which will then keep all 24-cores busy until the whole lot are done. You can specify fewer, or more jobs in parallel with, say, parallel -j 8 ...
Note 1: You could also list the names of additional servers in your network and it will spread the load across them too. GNU Parallel is capable of transferring the images to remote servers along with the jobs, but I'd have to question whether it makes sense to do that for this task - you'd probably want to put a subset of the images on each server with its own local disk I/O and run the servers independently yourself rather than distributing from a single point globally.
Note 2: You will want your disks well configured to handle multiple, parallel I/O streams.
Note 3: If you do write a script to process an image, write it so that it accepts multiple filenames as parameters, then you can run parallel -X and it will pass as many filenames as your sysctl parameter kern.argmax allows. That way you won't need a whole bash or OpenCV C/C++ process per image.

"tail -F" equivalent in lftp

I'm currently looking for tips to simulate a tail -F in lftp.
The goal is to monitor a log file the same way I could do with a proper ssh connection.
The closest command I found for now is repeat cat logfile.
It works but that not the best when my file is too big cause it displays each time all the file.
The lftp program specifically will not support this, but if the server supports the extension, it is possible to pull only the last $x bytes from a file with, e.g. curl --range (see this serverfault answer). This, combined with some logic to only grab as many bytes as have been added since the last poll, could allow you to do this relatively efficiently. I doubt if there are any off-the-shelf FTP clients with this functionality, but someone else may know better.

Adjust rate limit of a running pipe viewer (pv) process

I am using pipe viewer (pv) to limit the transfer rate while uploading VM backups to an online storage. Here is how I use it in a bash script:
ssh root#xenserver "xe vm-export uuid=${CurrentSnapshotUUID} filename=" | ${gpgEncrypt} | pv --quiet --rate-limit 300k | /usr/local/bin/aws s3 cp - ${bucketS3}/${CurrentVM}_${TodayDate}.xva.gpg
This works like a charm, but I have a limitation that I cannot upload with 300 KByte/s during peak time. This causes excessive traffic which is pretty expensive.
Unfortunately, I cannot split the data into several parts and upload them one after the other. It's one huge data stream generated by the vm export that I need to process in one go. And I need to find a way to lower the rate limit at a certain time without interrupting pv.
Does anyone have an idea how I can achieve this?
Cheers,
Rob
Thanks to Andrew Wood, the author of pv, I found the answer to my question. You can change the rate limit of a remote pv session with PID 123 like this:
pv --remote 123 --rate-limit 200k
What a cool feature. Case closed!

Emulating 'named' process substitutions

Let's say I have a big gzipped file data.txt.gz, but often the ungzipped version needs to be given to a program. Of course, instead of creating a standalone unpacked data.txt, one could use the process substitution syntax:
./program <(zcat data.txt.gz)
However, depending on the situation, this can be tiresome and error-prone.
Is there a way to emulate a named process substitution? That is, to create a pseudo-file data.txt that would 'unfold' into a process substitution zcat data.txt.gz whenever it is accessed. Not unlike a symbolic link forwards a read operation to another file, but, in this case, it needs to be a temporary named pipe.
Thanks.
PS. Somewhat similar question
Edit (from comments) The actual use-case is having a large gzipped corpus that, besides its usage in its raw form, also sometimes needs to be processed with a series of lightweight operations (tokenized, lowercased, etc.) and then fed to some "heavier" code. Storing a preprocessed copy wastes disk space and repeated retyping the full preprocessing pipeline can introduce errors. In the same time, running the pipeline on-the-fly incurs a tiny computational overhead, hence the idea of a long-lived pseudo-file that hides the details under the hood.
As far as I know, what you are describing does not exist, although it's an intriguing idea. It would require kernel support so that opening the file would actually run an arbitrary command or script instead.
Your best bet is to just save the long command to a shell function or script to reduce the difficulty of invoking the process substitution.
There's a spectrum of options, depending on what you need and how much effort you're willing to put in.
If you need a single-use file, you can just use mkfifo to create the file, start up a redirection of your archive into the fifo, and and pass the fifo's filename to whoever needs to read from it.
If you need to repeatedly access the file (perhaps simultaneously), you can set up a socket using netcat that serves the decompressed file over and over.
With "traditional netcat" this is as simple as while true; do nc -l -p 1234 -c "zcat myfile.tar.gz"; done. With BSD netcat it's a little more annoying:
# Make a dummy FIFO
mkfifo foo
# Use the FIFO to track new connections
while true; do cat foo | zcat myfile.tar.gz | nc -l 127.0.0.1 1234 > foo; done
Anyway once the server (or file based domain socket) is up, you just do nc localhost 1234 to read the decompressed file. You can of course use nc localhost 1234 as part of a process substitution somewhere else.
It looks like this in action (image probably best viewed in separate tab):
Depending on your needs, you may want to make the bash script more sophisticated for caching etc, or just dump this thing and go for a regular web server in some scripting language you're comfortable with.
Finally, and this is probably the most "exotic" solution, you can write a FUSE filesystem that presents virtual files backed by whatever logic your heart desires. At this point you should probably have a good hard think about whether the maintainability and complexity costs of where you're going really offset someone having to call zcat a few extra times.

List files accessed per-process with xperf?

With xperf I can generate a trace and get a "flat" listing of all files read like so:
xperf -on FileIO+FILE_IO+FILE_IO_INIT+FILENAME -stackwalk FileRead+FileWrite+FileDelete
xperf -start FileIOSession -heap -PidNewProcess "C:\Python27\x86\python.exe scratchy.py" -WaitForNewProcess -BufferSize 1024 -MinBuffers 128 -MaxBuffers 512 -stackwalk HeapAlloc+HeapRealloc -f ./tempheap.etl
xperf -stop FileIOSession -stop -d fileio.etl
xperf -i fileio.etl -o fio_output.txt -a filename
Unfortunately, the fio_output.txt file contains a list of every file imaginable that was accessed (from my web browser, IDE, etc). More frustratingly, if I manually open xperfview and open the File I/O Summary Table, I can see my process (python.exe in this case) and the one file it reads (for test purposes) but can't seem to find a way to output that same data on the CLI which is what I need--an unattended, automated method of generating file access info.
If you want to view this data then you should load the trace into WPA, open the file I/O table, and arrange the columns appropriately. Since you want to group by process you should have the process column first, then the orange bar, and then whatever data columns you want.
If you want to export the data to programmatically parse it then you should use wpaexporter.exe, new in WPT 8.1. See this blog post which I wrote describing how to do this:
https://randomascii.wordpress.com/2013/11/04/exporting-arbitrary-data-from-xperf-etl-files/
Using wpaexporter lets you decide exactly what data columns you want to export instead of being constrained by the limited set of trace processing actions that xperf.exe gives you.
I suspect you can get this data out from tracerpt.exe instead - I'd give that a try

Resources