Curl::Multi.download is broken? - ruby

When I do
ruby-1.9.2-p180 :002 > Curl::Multi.download(["http://www.bbc.com"])
=> nil
The downloaded file is of zero size
ls -l
-rw-r--r-- 1 staff 0 Jan 5 19:51 www.bbc.com
But the same works with Curl::Easy.download
Curl::Easy.download("http://www.bbc.com")
=> #<Curl::Easy http://www.bbc.com>
ls -l
-rw-r--r-- 1 staff 111549 Jan 5 19:53 www.bbc.com
Am I missing something?

You're not - Curl::Multi.http (and Curl::Multi.download as well, since it actually calls Curl::Multi.http) have been broken in Curb since version 0.7.15.
This totally baffled me for awhile but I figured out what was wrong and have submitted a patch (see https://github.com/taf2/curb/pull/115).
Hopefully this gets merged upstream - in the meantime, you should stick to Curb 0.7.14 and below.
UPDATE: My pull request has been merged: https://github.com/taf2/curb/commit/dceafcd931c4fb7ad2c8d356537358fd1286e4ae - you can grab curb/master for the fix.

Related

Where does hugging face's transformers save models?

Running the below code downloads a model - does anyone know what folder it downloads it to?
!pip install -q transformers
from transformers import pipeline
model = pipeline('fill-mask')
Update 2021-03-11: The cache location has now changed, and is located in ~/.cache/huggingface/transformers, as it is also detailed in the answer by #victorx.
This post should shed some light on it (plus some investigation of my own, since it is already a bit older).
As mentioned, the default location in a Linux system is ~/.cache/torch/transformers/ (I'm using transformers v 2.7, currently, but it is unlikely to change anytime soon.). The cryptic folder names in this directory seemingly correspond to the Amazon S3 hashes.
Also note that the pipeline tasks are just a "rerouting" to other models. To know which one you are currently loading, see here. For your specific model, pipeline(fill-mask) actually utilizes a distillroberta-base model.
As of Transformers version 4.3, the cache location has been changed.
The exact place is defined in this code section ​https://github.com/huggingface/transformers/blob/master/src/transformers/file_utils.py#L181-L187
On Linux, it is at ~/.cache/huggingface/transformers.
The file names there are basically SHA hashes of the original URLs from which the files are downloaded. The corresponding json files can help you figure out what are the original file names.
On windows 10, replace ~ with C:\Users\username or in cmd do cd /d "%HOMEDRIVE%%HOMEPATH%".
So full path will be: C:\Users\username\.cache\huggingface\transformers
As of transformers 4.22, the path appears to be (tested on CentOS):
~/.cache/huggingface/hub/
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="sentence-transformers/all-MiniLM-L6-v2", filename="config.json")
ls -lrth ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/
total 4.0K
lrwxrwxrwx 1 alex alex 52 Jan 25 12:15 config.json -> ../../blobs/72b987fd805cfa2b58c4c8c952b274a11bfd5a00
lrwxrwxrwx 1 alex alex 76 Jan 25 12:15 pytorch_model.bin -> ../../blobs/c3a85f238711653950f6a79ece63eb0ea93d76f6a6284be04019c53733baf256
lrwxrwxrwx 1 alex alex 52 Jan 25 12:30 vocab.txt -> ../../blobs/fb140275c155a9c7c5a3b3e0e77a9e839594a938
lrwxrwxrwx 1 alex alex 52 Jan 25 12:30 special_tokens_map.json -> ../../blobs/e7b0375001f109a6b8873d756ad4f7bbb15fbaa5
lrwxrwxrwx 1 alex alex 52 Jan 25 12:30 tokenizer_config.json -> ../../blobs/c79f2b6a0cea6f4b564fed1938984bace9d30ff0

Unix - pipeline ls - la | less C executable giving double total file size vs shell

*Edit - Stephen has answered this question in the comments below *
so basically I have made two seperate child processes(using two seperate methods with their own fork) to execute the command ls -la | less using pipe.
The first one executes ls like this:
execl("/bin/ls", "ls", "-la", NULL);
The second child process executes less like this:
execlp("less", "less", NULL);
And the results come up fine.. apart from one little part:
Results using shell command:
total 15
drwxr-xr-x 2 daniel staff 4 2015-02-27 18:58 .
drwxr-xr-x 15 daniel staff 24 2015-02-27 18:58 ..
-rwxr-xr-x 1 daniel staff 9280 2015-02-27 18:58 pipes
-rw-r--r-- 1 daniel staff 1419 2015-02-27 18:58 pipes.c
Results using my executable:
total 30
drwxr-xr-x 2 daniel staff 4 Feb 27 18:58 .
drwxr-xr-x 15 daniel staff 24 Feb 27 18:58 ..
-rwxr-xr-x 1 daniel staff 9280 Feb 27 18:58 pipes
-rw-r--r-- 1 daniel staff 1419 Feb 27 18:58 pipes.c
Now the date being a different format I don't care about.. but the total size is twice as large with my executable(30 vs 15). Why is this happening?
Make sure that the ls you are running from the shell and the ls that is running in your program are the same program.
Your program is specifying /bin/ls as the program to run; you can find out what is being run when you type the command at the shell prompt by using the shell command which ls (also see type ls).
If these are different it could be due to the POSIX vs. GNU blocksize used in the total size computation.
This answer indicates that POSIX and GNU tools differ in the blocksize they use to compute the total. Try adding h to your command and compare the results (this flag makes the units explicit). My guess is that the execution environment of these processes is disturbing the result. For instance it may not be the same ls that you are running.

python 3 suds cache not working

I'm trying to write a script for accessing Sharepoint via Python.
The following libraries have been installed: suds.jurko, ntlm.
The following code succeeds, but takes close to 20 seconds:
#!/usr/bin/env python3
from suds.client import Client
from suds.transport.https import WindowsHttpAuthenticated
from suds.cache import ObjectCache
url = 'http://blah/_vti_bin/Lists.asmx?WSDL'
user = "blah"
passwd = "blah"
ntlm = WindowsHttpAuthenticated(username=user, password=passwd)
client = Client(url, transport=ntlm)
I tried adding cache:
oc = ObjectCache()
oc.setduration(days=10)
client = Client(url, transport=ntlm, cache=oc)
I see /tmp/suds created and I see cached files under there, but it looks like it just creates more files on every run, instead of using the cached files:
-rw-r--r-- 1 pchernik smsvcs 3 Feb 5 13:27 version
-rw-r--r-- 1 pchernik smsvcs 309572 Feb 5 13:27 suds-536283349122900148-document.px
-rw-r--r-- 1 pchernik smsvcs 207647 Feb 5 13:27 suds-4765026134651708722-document.px
-rw-r--r-- 1 pchernik smsvcs 21097 Feb 5 13:27 suds-1421279777216033364-document.px
-rw-r--r-- 1 pchernik smsvcs 207644 Feb 5 13:27 suds-6437332842122298485-document.px
-rw-r--r-- 1 pchernik smsvcs 309572 Feb 5 13:27 suds-3510377615213316246-document.px
-rw-r--r-- 1 pchernik smsvcs 21097 Feb 5 13:28 suds-7540886319990993060-document.px
-rw-r--r-- 1 pchernik smsvcs 207617 Feb 5 13:30 suds-1166110448227246785-document.px
-rw-r--r-- 1 pchernik smsvcs 309548 Feb 5 13:30 suds-2848176348666425151-document.px
-rw-r--r-- 1 pchernik smsvcs 21076 Feb 5 13:31 suds-6077994449274214633-document.px
Is suds normally this slow?
Any ideas on fixing the caching issues?
Are there any other python 3 libraries I can use for this instead of suds?
Any ideas / suggestions are appreciated.
Thanks,
-Pavel
I had the same issue, try setting your cachingpolicy to 1:
client = Client(url, transport=ntlm, cache=oc, cachingpolicy=1)
This will cache your WSDL objects instead of your XML files.
From suds documentation:
cachingpolicy
The caching policy, determines how data is cached. The default is 0. version 0.4+
0 = XML documents such as WSDL & XSD.
1 = WSDL object graph.
Edit: I re-read your question and realized I am missing something important; your cache is getting re-generated. I believe this is due to not specifying a location for the cache. This is from the documentation of the FileCache class in cache.py:
If no cache location is specified, a temporary default location will be
used. Such default cache location will be shared by all FileCache
instances with no explicitly specified location within the same
process. The default cache location will be removed automatically on
process exit unless user sets the remove_default_location_on_exit
FileCache class attribute to False.
So, even if you want to use the default cache location, you will need to explicitly define it when you create your cache object. This is what I've done in my code:
# Configure cache location and duration ('days=0' = infinite)
cache_dir = os.path.join(os.path.abspath(os.sep), r'tmp\suds')
self.cache = ObjectCache(cache_dir, days=0)
You could also try setting the remove_default_location_on_exit attribute as suggested in the FileCache documentation, but I have not tried this method.
I had the same issue, but I noticed the version of suds-jurko in pypi has the following function in reader.py that generates the name of the cache file:
def mangle(self, name, x):
"""
Mangle the name by hashing the I{name} and appending I{x}.
#return: the mangled name.
"""
h = abs(hash(name))
return '%s-%s' % (h, x)
In python3 hash adds a random seed to the string. This has been fixed in the current version of suds-jurko at https://bitbucket.org/jurko/suds/ by using hashlib/md5 instead. You could either install it from there instead of pypi or just edit your reader.py file and change it to
h = hashlib.md5(name.encode()).hexdigest()
Are you sure you are using suds-jourko? It resembles very much the issue described here:
Suds is not reusing cached WSDLs and XSDs, although I expect it to
You could profile your application or run it with logging enabled (like suggested in the linked question).
As an alternative you could try osa: https://pypi.python.org/pypi/osa/
Edit: Did not see you already had installed suds-jourko

Use newsyslog to rotate log files, but only if they have a certain size

I'm on OS X 10.9.4 and trying to use newsyslog to rotate my app development log files.
More specifically, I want to rotate the files daily but only if they are not empty (newsyslog writes one or two lines to every logfile it rotates, so let's say I only want to rotate logs that are at least 1kb).
I created a file /etc/newsyslog.d/code.conf:
# logfilename [owner:group] mode count size when flags [/pid_file] [sig_num]
/Users/manuel/code/**/log/*.log manuel:staff 644 7 1 $D0 GN
The way I understand the man page for the configuration file is that size and when conditions should work in combination, so logfiles should be rotated every night at midnight only if they are 1kb or larger.
Unfortunately this is not what happens. The log files are rotated every night, no matter if they only the rotation message from newsyslog or anything else:
~/code/myapp/log (master) $ ls
total 32
drwxr-xr-x 6 manuel staff 204B Aug 8 00:17 .
drwxr-xr-x 22 manuel staff 748B Jul 25 14:56 ..
-rw-r--r-- 1 manuel staff 64B Aug 8 00:17 development.log
-rw-r--r-- 1 manuel staff 153B Aug 8 00:17 development.log.0
~/code/myapp/log (master) $ cat development.log
Aug 8 00:17:41 localhost newsyslog[81858]: logfile turned over
~/code/myapp/log (master) $ cat development.log.0
Aug 7 00:45:17 Manuels-MacBook-Pro newsyslog[34434]: logfile turned over due to size>1K
Aug 8 00:17:41 localhost newsyslog[81858]: logfile turned over
Any tips on how to get this working would be appreciated!
What you're looking for (rotate files daily unless they haven't logged anything) isn't possible using newsyslog. The man page you referenced doesn't say anything about size and when being combined other than to say that if when isn't specified, than it is as-if only size was specified. The reality is that the log is rotated when either condition is met. If the utility is like its FreeBSD counterpart, it won't rotate logs less than 512 bytes in size unless the binary flag is set.
MacOS' newer replacement for newsyslog, ASL, also doesn't have the behavior you desire. As far as I know, the only utility which has this is logrotate using its notifempty configuration option. You can install logrotate on your Mac using Homebrew

How do i monitor independed applications with god?

I'm currently looking into options for process monitoring of Rails/Ruby Projects and quite like god.
But i cant really find anything on how to monitor multiple applications (for example 2 rails projects running on one machine) with god.
As far as i see it i just set up god (system ruby) and have each project add its own configuration (maybe somehow in a deploy-hook).
This should also work with the projects running different ruby versions (rbenv, rvm) or bundler, since the god ruby does not have to access any project code.
Does anyone already use it like this? Or is there a better approach?
I use god to watch all my stuff (unicorn, redis, resque workers). Basic setup is like this:
God is installed globally, loads on system startup and reads its config file, /etc/god/all.god.
/etc/god/all.god
files = Dir.glob "/etc/god/**/*.god"
files.each do |f|
next if f == '/etc/god/all.god'
God.load f
end
This file loads all config files in /etc/god and its children. Deploy scripts put config files there and tell god to (re)load them.
$ ls -l /etc/god
total 16
-rw-r--r-- 1 root root 108 2012-02-23 16:26 all.god
drwxr-xr-x 2 sergio sergio 4096 2012-03-20 20:59 app1_production
drwxr-xr-x 2 sergio sergio 4096 2012-03-27 00:58 app2_production
drwxr-xr-x 2 root root 4096 2012-04-23 01:37 util
$ ls -l /etc/god/app1_production/
total 0
lrwxrwxrwx 1 sergio sergio 55 2012-03-20 20:59 redis.god -> /srv/app1_production/current/config/god/redis.god
lrwxrwxrwx 1 sergio sergio 56 2012-03-20 20:59 resque.god -> /srv/app1_production/current/config/god/resque.god
lrwxrwxrwx 1 sergio sergio 57 2012-03-20 20:59 unicorn.god -> /srv/app1_production/current/config/god/unicorn.god
Here's a head of unicorn.god.
rails_env = "production"
pid_dir = "/srv/app1_#{rails_env}/shared/pids"
rails_root = "/srv/app1_#{rails_env}/current"
God.watch do |w|
w.name = "unicorn-#{rails_env}"
w.interval = 30.seconds # default
# unicorn needs to be run from the rails root
w.start = "cd #{rails_root} && /home/sergio/.rvm/bin/r193_bundle exec unicorn_rails -c #{rails_root}/config/unicorn/unicorn.#{rails_env}.rb -E #{rails_env} -D"
# QUIT gracefully shuts down workers
w.stop = "kill -QUIT `cat #{pid_dir}/unicorn.pid`"
# USR2 causes the master to re-create itself and spawn a new worker pool
w.restart = "kill -USR2 `cat #{pid_dir}/unicorn.pid`"
As you can see, unicorns are launched via rvm wrappers and therefore each new app can use its own ruby. Also, you provide your own start, stop and restart commands, so you can use god to watch any piece of software.
This approach works very well for me (so far).

Resources