Performance problems reading lots of small files using libssh2

Performance problems reading lots of small files using libssh2 - libssh2

I am trying to read lots of small files using libssh2.
I am currently using libssh2_scp_recv/libssh2_channel_read and I have also tried libssh2_sftp_open/libssh2_sftp_read.
With large files, I am able to get a speed similar to scp. But with small files most of my time is passed opening a handle to my remote file (libssh2_scp_recv) and not downloading the file (libssh2_channel_read).
How does scp does it?
Is there a simple way to batch download multiple files so I will be able to saturate my connection?

Not unless you write your own SFTP layer on top of libssh able to use pipelining.
Maybe an easier solution would be to use several threads, every one establishing an independent SSH connection in order to retrieve several files in parallel.

Related

Move/copy millions of images from Macos to external drive to ubuntu server

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.
I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.
I tried "cp /path/to/dataset /path/to/external-ssd"
Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.
Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.
Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. It's been 24 hours and it isn't quite halfway done.
Is there a better way to do this?

rsync is the preferred tool for this kind of workload. It is used for both local and network copies.
Main benefits are (excerpt from manpage):
delta-transfer algorithm, which reduces the amount of data sent
if it is interrupted for any reason, then you can restart it easily with very little cost. It can even restart part way through a large file
options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied.
Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.
Regarding command usage and syntax, for local transfers is almost the same as cp:
rsync -az /path/to/dataset /path/to/external-ssd

Julia invoke script on existing REPL from command line

I want to run a Julia script from window command line, but it seems everytime I run > Julia code.jl, a new instance of Julia is created and the initiation time (package loading, compiling?) is quite long.
Is there a way for me to skip this initiation time by running the script on the current REPL/Julia instance? (which usually saves me 50% of running time).
I am using Julia 1.0.
Thank you,

You can use include:
julia> include("code.jl")

There are several possible solutions. All of them involve different ways of sending commands to a running Julia session. The first few that come to my mind are:
use sockets as explained in https://docs.julialang.org/en/v1/manual/networking-and-streams/#A-simple-TCP-example-1
set up a HTTP server e.g. using https://github.com/JuliaWeb/HTTP.jl
use named pipes, as explained in Named pipe does not wait until completion in bash
communicate e.g. through the file system (e.g. make Julia scan some folder for .jl files and if it finds them there they get executed and moved to another folder or deleted) - this is probably simplest to implement correctly
In all the solutions you can send the command to Julia by executing some shell command.
No matter which approach you prefer the key challenge is sanitizing the code to handle errors properly (i.e. a situation when you sent some command to the Julia session and it crashes or when you send requests faster than Julia is able to handle them). This is especially important if you want the Julia server to be detached from the terminal.
As a side note: when using the Distributed module from stdlib in Julia for multiprocessing you actually do a very similar thing (but the communication is Julia to Julia) so you can also have a look how this module is implemented to get the feeling how it can be done.

Way to represent unknown file size in FTP LIST?

I have an FTP server running on a device which generates a synthetic file. i.e. The file is not actually present on the file system. It is generated on the fly whenever it is requested for download. Its size is not known until it is generated.
Certain FTP clients (e.g. Windows Explorer) are apparently using the info from the FTP LIST command to determine the expected file size after they download it. When I request the download, Windows Explorer is first trying the FTP SIZE extension (RFC 3659), which the server doesn't support. Then it downloads the entire file (I see the whole contents in Wireshark), but it truncates the file to the size that was reported in the LIST.
What is the expected behavior here? Is Windows Explorer acting outside of spec with this optimization / safety measure?
Our current work-around is to show the file size as bigger than it will ever be in the file listing.
Is there a way to represent an unknown file size in an ls -l / FTP LIST? As I read RFC 959, the LIST format is not standardized, but I'm guessing the POSIX ls '-l' format is pretty common, and that Windows Explorer is parsing that.
How should I be handling this situation?

What is the expected behavior here? Is Windows Explorer acting outside of spec with this optimization / safety measure?
The spec does not say anything about checking the file size against the transferred size, but if you strictly follow the spec you don't know if the file was transferred fully. All you can get on the client is that the data connection closed, but this might be also because of problems on the server side.
So the client is trying to work around these limits.
Is there a way to represent an unknown file size in an 'ls -l' / FTP LIST? As I read RFC 959, the LIST format is not standardized, but I'm guessing the POSIX format is pretty common, and that Windows Explorer is parsing that.
Yes, there is no standardized format, but commonly the output of ls -l is used. Since this relates to a file system with static files which all have a known file size there is no way to give something like an "unknown" size. But since the format is not standardized you might try to simply give the file list, without any size (like with NLST).
Apart from that your client just implements the typical use case for FTP, i.e. transferring static files. In this case the files sizes are known. But you are using FTP in an atypical way, so you have to deal with these missed expectations and cheat. Since FTP is a nightmare anyway when firewalls or private networks are involved you might better switch to other widely used protocols, like HTTP or WebDAV.

The workaround here for client could be to use wget or 'ftp' command. 'ftp' command example:
open console
ftp user#yourserver
cd to_your_path
get your_filename.
Because neither wget nor 'ftp get command' don't need size to download file.
One more solution could be usage of own ftp server. You can find a lot of whose servers with basic features, gui client support and simple code on scripting language, like php. It's simple to change the code for this feature. I have done the same for one of the projects.

How can I copy a file from VMS to Windows and back again?

I am trying to copy C source files from a vms alpha to a windows machine to allow easier editing of the code. (VMS editor is just a text editor and it would be nice to have syntax highlighting etc)
I can copy this across using Exceed FTP and this handles the issue of duplicate filenames with version suffix that vms has:
File.c;1
File.c;2
Flle.c;3
But when I open a file I've transferred, all the line breaks have been lost and the entire file is just one line.
Can anyone recommend a solution to this or offer any hints?
Thanks in advance
ps. I need to be able to copy the files back to vms and still maintain format.

It may be off interest by now, but in case you still wonder about "one-line" text files after FTP transfer.
The short answer: force the FTP transfer mode to ASCII (or text) in your FTP client. This will make sure that the C-files you transferring (in fac all files) are treated as text, otherwise they're assumed to be binary, so you get a byte-stream.
Long answer: There're 2 FTP transfer modes: ASCII/text and binary/image. The default is sometimes clent or server-specific.
Many clients have Auto-mode that interprets the file extension to set the proper transfer mode (.TXT,.CSV etc..)
When you access the VMS server via FTP client, too often the [Win-based] client is not VMS friendy, so it does not parse the file-list properly. Thus it gets confused by version number appended to the "usual" file-name:
filename.ext;ver ==> file.c;1
So instead of seeing .C (and assuming text), it sees .C;1 and thinks it's binary.
I use Filezilla FTP client to/from VMS and so far it does it properly (though version-support is not as I'd sometimes like).

Copying a file to and from your windows desktop every time you want to edit gets old very quickly.
You may be able to implement a much nicer alternative. There is some software under VMS that permits a VMS directory tree to be treated as a "network disk" under windows. Once you've set it up, and set up your windows to recognize the network disk, you can just open the file with a windows text editor without moving it from VMS to windows. You can also browse the directory tree, which appears like a tree of folders.
When you issue a save from your text editor, the saved copy supercedes the previous version over in VMS land. And it mediates correctly between RMS format and embedded newline format. It's a whole lot more convenient than FTP, for this purpose.
After doing a quick Google search, I think the name of the VMS software is PATHWORKS. But I'm not sure.

A few points I have on this
PATHWORKS is fairly old and (as far as I recall - I dont use it) doesnt work well with recent windows versions, such as supporting Active Directory. Within the last few years HP have ported SAMBA to VMS and this is the way to go if you want to make areas of disk visible to windows machines. Should be easyt to find on HP web site.
If you want to try the FTP/SFTP route I would try SFTP and go for VMS version of at least 8.2. TheTCPIP suite was rewritten (or reported from a Unix version) at this point.
VMS supports a number of formats for text files. As well as the complex record structure described above, there is STREAM_LF which is the same as a unix file and STREAM_CRLF

I found some interesting information about OpenVMS text file structure. That corresponds with a vague memory I have of how VMS handles text files; they're not stored as streams of bytes like Windows and Unix systems, but as a sequence of records (each record is a text line). Records can be either fixed width or variable width. Whatever reads the file is responsible for the "paper control", what we normally call newlines these days.
You might check the options in Exceed FTP to make sure that you're transferring the file in an appropriate ASCII mode. There might be special options you need to set on the FTP server to read and write the files in the appropriate mode too.

I'm no expert - let's get that out and in the open ;)
I have been having similar problems in FTPing files from OVMS Alphaserver to Win7 desktop so I can migrate to SQL.
FTP [Attachmate & WIn CLI] workled fine under WinNT.
I suspect Win7 does not like the name.ext;version format of the OVMS file.
Filezilla - doesn't work.
PuTTY - doesn't work
Window CLI FTP - doesn't work [partial file transfer; times out & truncates file].
Using Attachmate's "Reflections for the Web 2011" to emulate Vax terminal - works fine.
Think I'll have to go back to Attachmate for assistance but partially hamstrung by our [Australian Fed Govt] IT services which has the final say

Some editors, such as BBEdit on the Mac, support directly opening/saving files via FTP/SFTP/etc. (BBEdit also supports various different line endings as used on different platforms, which would help with your other problem). I expect there must be a Windows editor with similar functionality - my Windows-using colleagues all rave about something called CodeWrite (or CodeWright ?) so I guess I would take a look at something like that.

Unmovable Files on Windows XP

When I defragment my XP machine I notice that there is a block of "Unmovable Files". Is there a file attribute I can use to make my own files unmovable?
Just to clarify, I want a way to programmatically tell Windows that a file that I create should be unmovable. Is this possible, and if so, how can I do it?
Thanks,
Terry

A lot of system files cannot be moved after the system boots, such as the page file and registry database files.
This utility runs before Windows boots to defragment those files. I have it set to run at every boot, and it works well for me on several machines.
Note that the very first time you boot up with this utility set to run, it may take several minutes to defrag. After that first run though, it finishes in just 3 or 4 seconds.
Edit0: To respond to your clarification- that link says windows has marked the page file and registry files as open for exclusive access. So you should be able to do the same thing with the LockFile API Call. However, that's not an attribute of the file itself. You'd have to actually run some background program that locks the file for exclusive access.

There are no file attributes that you can place on your files to mark them as immovable. The only way that a file cannot be moved (I think) during defragmentation is to have some other process have the file open (for read or write, I'm not even sure that you need to have the file open in exclusive mode or not).
Quite frankly, I cannot think of a reason that you'd want your files not to move, unless you have specific requirements about where on the disk platter your files reside. Defragmentation should generally lead to faster disk access and that seems to be desireable in all cases :-)

This usually means that the file is in use by some process. If you're defragmenting, you'll likely see this with a lot of system files. If the file should legitimately be movable and is stuck (it's being held by a process that runs at startup but shouldn't be, for example), the most useful way of resolving the problem is to remove all permissions on the file, reboot, restore the permissions, and then get rid of the file/run the program that's trying to use it.

I suppose the ugly way is to have an application boot on startup, check every few seconds if defrag is running and if so open the file in exclusive mode.
This is really ugly and I don't recommend it unless there is no cleaner solution.

Terry, the answers all mention ways to prevent files from becoming unmovable during defragmentation. From your question it appears that you are in fact wanting to make your personal files unmovable. Can you please clarify what is appealing about making your files unmovable.

I assume you're using the defragger that comes with Windows. Some commercial ones like DiskKeeper can move some of these files (usually system files). You can try their trial versions.

Contig might serve your purpose http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx
I'm relatively certain I ran across some methods/attributes you could access programatically to do exactly what you want. This was back in NT4 days though and my memory isn't that good.

For a little more complete solution try Raxco's PerfectDisk. While it is a commercial product it does a very good job and supports boot time defrag of system files. The first defrag takes longer than say DiskKeeper but its a single pass defragger and supports defragging with very little free space left on the drive. Overall its a much smarter defrag program then any other I've seen and supports systems of any size.
http://www.raxco.com/

first try to move(or delete) the files within safe mode. If can not, try to move(or delete) the files with linux.
But be careful if those are the windows system files, then you are failed to boot up your windows.
Some reason why the files are unmovable are : the file size is too big, the files are being in open/in use condition, insufficient security privileges, being access by other computer/s, and many other things.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Performance problems reading lots of small files using libssh2 - libssh2

Not unless you write your own SFTP layer on top of libssh able to use pipelining. Maybe an easier solution would be to use several threads, every one establishing an independent SSH connection in order to retrieve several files in parallel.

Related

Move/copy millions of images from Macos to external drive to ubuntu server

Julia invoke script on existing REPL from command line

Way to represent unknown file size in FTP LIST?

How can I copy a file from VMS to Windows and back again?

Unmovable Files on Windows XP

Categories

Resources