Why does upload a model to HuggingFace repository so slow?

Why does upload a model to HuggingFace repository so slow? - huggingface-transformers

I have a problem, I'm trying to push a model to HuggingFace repository. The problem is that it says uploading for the past 16 hours and that's just the pytorch_model bin file which is about 850MB. I am using the LFS. I have tried to manually add the files to the repository which takes an eternity that I am not willing to wait, there's no completion percentage so you are not aware if it's progressing or hanging.
I have tried using the git commands, same long wait.
However, if I try to upload to Github rather than HuggingFace, it doesn't take an eternity 30 mins at most. I felt as if I wasted my time doing all this preprocessing and training for nothing.
Any suggestions or similar problems ?

Related

doing restartable downloads in ruby

I've been trying to figure out how to use the Down gem to do restartable downloads in ruby.
So the scenario is downloading a large file over an unreliable link. The script should download as much of the file as it can in the timeout allotted (say it's a 5GB file, and the script is given 30 seconds). I would like that 30 second progress (partial file) to be saved so that next time the script is run, it will download another 30 seconds worth. This can happen until the complete file is downloaded and the partial file is turned into a complete file.
I feel like everything i need to accomplish this is in this gem, but it's unclear to me which features i should be using, and how much of it i need to code myself. (streaming? or caching?) I'm a ruby beginner, so i'm guessing i use the caching and just save the progress to a file myself, and enumerate for as many times as i have time.
How would you solve the problem? Would you use a different gem/method?

You probably don't need to build that yourself. Existing tools like curl and wget already have that functionality.
If you really want to build it yourself, you could perhaps take a look at how curl and wget do it (they're open-source, after all) and implement the same in Ruby.

Artifactory Delay

As we noticed that with artifacts uploaded to Artifactory, they do not appear available via pip straight away. As minimum 5 minutes before they can be downloaded and installed via pip. It seems like they are not indexed straight away or waiting for some timeslot to do so. Could not find any configuration related to this which is not helpful.

I found this, which might be helpful to you:
When you upload many Pypi packages to the same repository within a close period of time the indexing does not happen immediately. It waits for a "quiet period" which can be adjusted. This can be done in the $ARTIFACTORY_HOME/etc/artifactory.system.properties file by setting the values of the artifactory.pypi.index.quietPeriodSecs and the artifactory.pypi.index.sleepMilliSecs properties to an amount of seconds that meets your needs. If those parameters do not exist, please add them to the file. You will need to restart Artifactory for this setting to take affect.
From what I can tell, if these values aren't in that file, both default to 60. Also sleepMilliSecs appears to be a number of seconds, not milliseconds as the name would suggest.
I believe how this works is, Artifactory waits for the repository to "settle", until there hasn't been any changes (deployed or removed packages) for at least quietPeriodSecs seconds. It will check for this every sleepMilliSecs seconds.
Five minutes seems like a long time. If you're making a series of changes with under a minute before each change, that might explain why it's taking a while. Also, the larger your repository is, the longer the indexing will take once it starts, so that might also be a factor.

Resetting TFS History and other tracked elements while keeping all current files

I've been given the dubious hack of a task to stop tracking all "TFS stuff" from about a month back to the time it first started. Why? Because it's currently a behemoth. Over half a million files and close to six digits of changesets across loads of inter/cross-connected branches for a "legacy" code repository where no one cares about the "old stuff"'s history or access to old files, but they need what's currently in the repo to stay there.
The problem is, at its current state it is making interacting with TFS a huge pain. The few devs who still touch this code base often have IDE crashes when trying to do simple things like, access the source control explorer, checkin, etc... really basic things have become sketchy from TFS bloat. If I want to merge a single changeset from one active branch to another, I get lag with the lists loading for 5-10 minutes instead of a few seconds.
While researching, I ran across this old question. It seems like it may be what I need, but I'd hate to find out the hard way that I used the commands incorrectly and have to re-load the snapshot for our TFS server, as this repo is worked on in multiple timezones.
How do I reset/purge history, shelvesets, work items, and anything else before say, C20,0000, meaning, I don't care about anything before C20,000, but I want all the files that exist at the time of C20,000, even if one of those files was part of say, C12, but is still in the repo and unchanged.
If the linked-to answer provides the answer I need, but not the clarity, I'm fine closing this as a dupe so long as the other answer gets updated with clarification.
I'd like to do something along the lines of this pseudo-code:
tfs nuke $collection/$ProjectName beforeChangeset:C20000 /keepCurrentFiles
I'm guessing that will require multiple commands for things like files and shelvesets, but that's the gist of what I'm trying to accomplish.

What's the best way to migrate from SourceSafe to ClearCase?

We currently have a fat SourceSafe DB with ten years of code in it. We're looking for an easy and stable way to import all of this in to a new Clearcase/Jazz environment.
What is the best best method of doing this and are there any tools out there to do this automagically?

I know this doesn't answer your question directly, but we had a similar problem several years ago when we moved from VSS to Perforce. We looked at the ways in which we could migrate the histories for all the files, but any solution we found had problems and would have taken a long time to execute.
In the end we simply decided to import the current version of the code into Perforce as the baseline and leave the old history in VSS. In the early days we did refer back to VSS occasionally, but after a few months we didn't need to.
If there's a problem with a file you only need the last couple of revisions to be able to see what's changed and why. So if the file changes fairly frequently you'll soon build up a useful history in the new repository. If the file doesn't change then by definition it's stable an you don't need the history.
If you back up the old repository you can always go back to it if you really need to to dig out the hold history.

In theory, clearexport_ssafe is the right tool:
The clearexport_ssafe utility reads the files and sub-projects in your Microsoft® Visual SourceSafe current project and generates a data file, which clearimport uses to create equivalent VOB elements.
By default clearexport_ssafe exports the files and subprojects in the Visual SourceSafe current project, but it does not export any files contained in subprojects. To export all files in all subprojects, specify the –r option.
In practice, the migrations I made (not from VSS though) involved the import of a few recent labels, and then the HEAD, into ClearCase.
That means the main tool I use for any import (from any other VCS) is clearfsimport.
You may loose some meta-data (like the author of a version, and labels), but at least it is source-agnostic, and since your massive import only concern an handfull of labels from the source, you end up quickly with an operational VOB.

From IBM's web site:
http://www-01.ibm.com/support/docview.wss?ratlid=cctocbody&rs=984&uid=swg21222507
and this:
http://www.cmcrossroads.com/component/option,com_fireboard/func,view/id,63051/catid,31/Itemid,593/
However, ChrisF's answer is the same that I would suggest.
The effort involved generally is not worth it given the "benefit" of migrating the history.
I would just take snapshots of the current "tips" of the branches and put those under your new version control system.
I went through this exercise at least 3 times in my career. One conversion to Perforce and two to SVN.
I think I recall that we did some partial history imports, but then just dropped it all as the information we needed was in some other form. The actual repository history of changes just wasn't important enough to go through the pain. We did keep the database around for a year or so in case anyone wanted to look. I don't recall anyone complaining about it.
(I'm also curious why anyone would choose ClearCase over the rest of the ones out there - my guess is for integration with other Rational/IBM stuff)
EDIT
I would ask ClearCase/IBM. They'll have the best up to date information.

I actually lived through a VSS to Clear Case conversion. Rational had a conversion tool that we ran. It took FOREVER (2-3 days, but see below) to complete on our VSS database of maybe 2 years (maybe it was 3 years, but not close to your 10 years). But it worked far as I recall. It maintained the history and labels.
The slowness problem was likely due to a flaky RAID controller in our new source control server. The imports worked fine, but Clear Case would detect corruptions in it's data after a few days of working (often after a label). After several re-imports, firmware updates, and a new server it all worked out.
I'd still plan to give the import a weekend to run. Try to get someone who can remote in occasionally to check it's progress.
On a side note, I've also done VSS conversions to Perforce and TFS. In general, I suggest giving the import tools a try. If they work, great. If they give you grief, just do what everyone else answering is suggesting: just start over by adding all the files as new.

I'd take the most significant labels and import them into ClearCase by 'clearfsimport'.

What is the initial cost of setting up CruiseControl?

What is the initial cost of setting up CruiseControl?

The key point here is not the time you have to invest in setting up CruiseControl. You can do this in an hour or so. The question is weather do you have a code repository (SVN, TFS) and a build script ready (something - MSBuild script or so - that will clean, rebuild, test and deploy your app). If not you will have to invest some time in that - depending on how complicated your project is - but surely it will take a lot more time than setting up a CruiseControl server.

Not more than two to three hours worth if you're new to it. The first time I used it I had something that checked out the latest version from subversion, compiled it using MSBuild and then upload it in less than that time.

I'd recommend Hudson over CruiseControl any day of the week. I can't think of anything CruiseControl can do which Hudson doesn't do (better). Especially the web-based frontend is far superior. You can run Hudson directly on your machine (using JNLP) and have your project setup in minutes.

It takes a little while to get it up and running - but you can get a solution to build using the task to build you .sln file in less than a day if you're a complete newbie on the subject.
It gets a little more complicated when you add unit testing in various frameworks, costumizing the dashboard, labelling your builds etc but it's a matter of days, not weeks to get anything up and running.

Software - free.
Hardware - cost depends. If you only want to run nightly it can probably share server space with something else. We use a dedicated server with builds every 15 mins.
Set up time - Once learnt you're looking at a few hours to set up a new server. If you're new to CC allow a day or two.
If you've never used an integration server before you're going to have a learning curve for the entire team - allow a few weeks.
We've recently moved to a new server and we set up a fresh installation - it took a few hours. That's for four projects, two different source control providers, and includes custom tasks like reporting and building help files.

I'd recoccomend a dedicated machine for cruise control, it doesn't have to be amazingly powerful but bear in mind it had to be able to compile your code.
We used an old developer's machine, which was put aside after an upgrade.
As far as the cost in time a day should have you up and running.

How do you define 'cost'? It's free to download so there's no monetary cost.
In terms of time it should take between 1/2 - 1 day, depending on how complicated your configuration is.

If you have a simple project with no dependencies then a couple of hours. If you are actually doing 'integration' of many projects with many dependencies then several weeks and possibly code changes. IMHO CC.Net doesn't scale well to large numbers of projects...

You should be able to set it up in about 3 hours and it's totally free.
Still you can spend money on external tools like Simian etc, but that's totally optional. Setting up CCnet really is a matter of going through the configuration documentation and that's it.
I blogged about my experiences with CCnet before: http://www.tigraine.at/2008/10/08/another-take-on-contiuous-integration/

Jay Flowers runs a project called CI Factory which enables you to put together a CruiseControl.NET installation with optional modules in no time at all.
http://jayflowers.com/joomla/
Also, you might wish to listen to the .NET rocks podcast interview with him:
http://www.dnrtv.com/default.aspx?showID=64

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio