Where does hugging face's transformers save models? - huggingface-transformers

Running the below code downloads a model - does anyone know what folder it downloads it to?
!pip install -q transformers
from transformers import pipeline
model = pipeline('fill-mask')

Update 2021-03-11: The cache location has now changed, and is located in ~/.cache/huggingface/transformers, as it is also detailed in the answer by #victorx.
This post should shed some light on it (plus some investigation of my own, since it is already a bit older).
As mentioned, the default location in a Linux system is ~/.cache/torch/transformers/ (I'm using transformers v 2.7, currently, but it is unlikely to change anytime soon.). The cryptic folder names in this directory seemingly correspond to the Amazon S3 hashes.
Also note that the pipeline tasks are just a "rerouting" to other models. To know which one you are currently loading, see here. For your specific model, pipeline(fill-mask) actually utilizes a distillroberta-base model.

As of Transformers version 4.3, the cache location has been changed.
The exact place is defined in this code section ​https://github.com/huggingface/transformers/blob/master/src/transformers/file_utils.py#L181-L187
On Linux, it is at ~/.cache/huggingface/transformers.
The file names there are basically SHA hashes of the original URLs from which the files are downloaded. The corresponding json files can help you figure out what are the original file names.

On windows 10, replace ~ with C:\Users\username or in cmd do cd /d "%HOMEDRIVE%%HOMEPATH%".
So full path will be: C:\Users\username\.cache\huggingface\transformers

As of transformers 4.22, the path appears to be (tested on CentOS):
~/.cache/huggingface/hub/

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="sentence-transformers/all-MiniLM-L6-v2", filename="config.json")
ls -lrth ~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/
total 4.0K
lrwxrwxrwx 1 alex alex 52 Jan 25 12:15 config.json -> ../../blobs/72b987fd805cfa2b58c4c8c952b274a11bfd5a00
lrwxrwxrwx 1 alex alex 76 Jan 25 12:15 pytorch_model.bin -> ../../blobs/c3a85f238711653950f6a79ece63eb0ea93d76f6a6284be04019c53733baf256
lrwxrwxrwx 1 alex alex 52 Jan 25 12:30 vocab.txt -> ../../blobs/fb140275c155a9c7c5a3b3e0e77a9e839594a938
lrwxrwxrwx 1 alex alex 52 Jan 25 12:30 special_tokens_map.json -> ../../blobs/e7b0375001f109a6b8873d756ad4f7bbb15fbaa5
lrwxrwxrwx 1 alex alex 52 Jan 25 12:30 tokenizer_config.json -> ../../blobs/c79f2b6a0cea6f4b564fed1938984bace9d30ff0

Related

python 3 suds cache not working

I'm trying to write a script for accessing Sharepoint via Python.
The following libraries have been installed: suds.jurko, ntlm.
The following code succeeds, but takes close to 20 seconds:
#!/usr/bin/env python3
from suds.client import Client
from suds.transport.https import WindowsHttpAuthenticated
from suds.cache import ObjectCache
url = 'http://blah/_vti_bin/Lists.asmx?WSDL'
user = "blah"
passwd = "blah"
ntlm = WindowsHttpAuthenticated(username=user, password=passwd)
client = Client(url, transport=ntlm)
I tried adding cache:
oc = ObjectCache()
oc.setduration(days=10)
client = Client(url, transport=ntlm, cache=oc)
I see /tmp/suds created and I see cached files under there, but it looks like it just creates more files on every run, instead of using the cached files:
-rw-r--r-- 1 pchernik smsvcs 3 Feb 5 13:27 version
-rw-r--r-- 1 pchernik smsvcs 309572 Feb 5 13:27 suds-536283349122900148-document.px
-rw-r--r-- 1 pchernik smsvcs 207647 Feb 5 13:27 suds-4765026134651708722-document.px
-rw-r--r-- 1 pchernik smsvcs 21097 Feb 5 13:27 suds-1421279777216033364-document.px
-rw-r--r-- 1 pchernik smsvcs 207644 Feb 5 13:27 suds-6437332842122298485-document.px
-rw-r--r-- 1 pchernik smsvcs 309572 Feb 5 13:27 suds-3510377615213316246-document.px
-rw-r--r-- 1 pchernik smsvcs 21097 Feb 5 13:28 suds-7540886319990993060-document.px
-rw-r--r-- 1 pchernik smsvcs 207617 Feb 5 13:30 suds-1166110448227246785-document.px
-rw-r--r-- 1 pchernik smsvcs 309548 Feb 5 13:30 suds-2848176348666425151-document.px
-rw-r--r-- 1 pchernik smsvcs 21076 Feb 5 13:31 suds-6077994449274214633-document.px
Is suds normally this slow?
Any ideas on fixing the caching issues?
Are there any other python 3 libraries I can use for this instead of suds?
Any ideas / suggestions are appreciated.
Thanks,
-Pavel
I had the same issue, try setting your cachingpolicy to 1:
client = Client(url, transport=ntlm, cache=oc, cachingpolicy=1)
This will cache your WSDL objects instead of your XML files.
From suds documentation:
cachingpolicy
The caching policy, determines how data is cached. The default is 0. version 0.4+
0 = XML documents such as WSDL & XSD.
1 = WSDL object graph.
Edit: I re-read your question and realized I am missing something important; your cache is getting re-generated. I believe this is due to not specifying a location for the cache. This is from the documentation of the FileCache class in cache.py:
If no cache location is specified, a temporary default location will be
used. Such default cache location will be shared by all FileCache
instances with no explicitly specified location within the same
process. The default cache location will be removed automatically on
process exit unless user sets the remove_default_location_on_exit
FileCache class attribute to False.
So, even if you want to use the default cache location, you will need to explicitly define it when you create your cache object. This is what I've done in my code:
# Configure cache location and duration ('days=0' = infinite)
cache_dir = os.path.join(os.path.abspath(os.sep), r'tmp\suds')
self.cache = ObjectCache(cache_dir, days=0)
You could also try setting the remove_default_location_on_exit attribute as suggested in the FileCache documentation, but I have not tried this method.
I had the same issue, but I noticed the version of suds-jurko in pypi has the following function in reader.py that generates the name of the cache file:
def mangle(self, name, x):
"""
Mangle the name by hashing the I{name} and appending I{x}.
#return: the mangled name.
"""
h = abs(hash(name))
return '%s-%s' % (h, x)
In python3 hash adds a random seed to the string. This has been fixed in the current version of suds-jurko at https://bitbucket.org/jurko/suds/ by using hashlib/md5 instead. You could either install it from there instead of pypi or just edit your reader.py file and change it to
h = hashlib.md5(name.encode()).hexdigest()
Are you sure you are using suds-jourko? It resembles very much the issue described here:
Suds is not reusing cached WSDLs and XSDs, although I expect it to
You could profile your application or run it with logging enabled (like suggested in the linked question).
As an alternative you could try osa: https://pypi.python.org/pypi/osa/
Edit: Did not see you already had installed suds-jourko

Use newsyslog to rotate log files, but only if they have a certain size

I'm on OS X 10.9.4 and trying to use newsyslog to rotate my app development log files.
More specifically, I want to rotate the files daily but only if they are not empty (newsyslog writes one or two lines to every logfile it rotates, so let's say I only want to rotate logs that are at least 1kb).
I created a file /etc/newsyslog.d/code.conf:
# logfilename [owner:group] mode count size when flags [/pid_file] [sig_num]
/Users/manuel/code/**/log/*.log manuel:staff 644 7 1 $D0 GN
The way I understand the man page for the configuration file is that size and when conditions should work in combination, so logfiles should be rotated every night at midnight only if they are 1kb or larger.
Unfortunately this is not what happens. The log files are rotated every night, no matter if they only the rotation message from newsyslog or anything else:
~/code/myapp/log (master) $ ls
total 32
drwxr-xr-x 6 manuel staff 204B Aug 8 00:17 .
drwxr-xr-x 22 manuel staff 748B Jul 25 14:56 ..
-rw-r--r-- 1 manuel staff 64B Aug 8 00:17 development.log
-rw-r--r-- 1 manuel staff 153B Aug 8 00:17 development.log.0
~/code/myapp/log (master) $ cat development.log
Aug 8 00:17:41 localhost newsyslog[81858]: logfile turned over
~/code/myapp/log (master) $ cat development.log.0
Aug 7 00:45:17 Manuels-MacBook-Pro newsyslog[34434]: logfile turned over due to size>1K
Aug 8 00:17:41 localhost newsyslog[81858]: logfile turned over
Any tips on how to get this working would be appreciated!
What you're looking for (rotate files daily unless they haven't logged anything) isn't possible using newsyslog. The man page you referenced doesn't say anything about size and when being combined other than to say that if when isn't specified, than it is as-if only size was specified. The reality is that the log is rotated when either condition is met. If the utility is like its FreeBSD counterpart, it won't rotate logs less than 512 bytes in size unless the binary flag is set.
MacOS' newer replacement for newsyslog, ASL, also doesn't have the behavior you desire. As far as I know, the only utility which has this is logrotate using its notifempty configuration option. You can install logrotate on your Mac using Homebrew

Hadoop Log File Analysis from 2 separate machines

I am a fresher to Hadoop. I have to find the trend of symbols traded among users.
I have 2 machines b040n10 and b040n11. The files in the machine are as mentioned below:
b040n10:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 482342353 Feb 8 2014 A.log
-rw-r--r-- 1 root root 481231231 Feb 8 2014 B.log
b040n11:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 412312312 Feb 8 2014 C.log
-rw-r--r-- 1 root root 412356315 Feb 8 2014 D.log
There is a field called "symbol_name" on all these logs (example below).
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:13:05
IP=145.45.34.2;***symbol_name=XYZ;***timestamp=12:13:56
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:14:56
I have Hadoop running on my Laptop and I have 2 machines connected to my Laptop (can be used as Datanodes).
My task now is to get the list of symbol_name and the Symbol count.
As mentioned below:
ABC - 2
XYZ - 1
Should I now:
1. copy all the files (A.log,B.log,C.log,D.log) from b040n10 and b040n11 to my Laptop,
2. Issue a copyFromLocal command to HDFS system and analyze the data?
or is there a better way to findout the symbol_name and count without copying these files to my laptop?
The question is a basic one, but I am new to Hadoop, please help me to understand and use Hadoop to better. Please let me know if more information on the question is need.
Thanks
Copying the files from Hadoop to your local laptop defies the entire purpose of Hadoop which is to move the processing to the data not the other way. Because when you really have "BigData", you won't be able to move the data around to process it locally.
Your problem is a typical case of Map/Reduce, all what you need is a job that counts the occurrence of each symbol. Just search for Map/Reduce WordCount example and adapt it to your case

What are sufficient file information for file hashing?

Since making a hash of a complete binary file would be far too heavy to compute in a reasonably fast time:
What are sufficient file information for hashing a file? The following lists what properties the resulting hashing should ideally have:
collision-free in respect to other files in the directory
fast
catch all file changes
As a rule of thumb, the less information we can use to create enough entropy, the better. Since the speed of retrieval for specific information may depend largely on the given environment (OS, file-IO of the language, IO of the used library, etc.), it should be disregarded here.
(This is my first attempt at a community wiki. My reason for making it one is that the information asked here is very generic but (hopefully) informative. I also would like this question to be marked as a community wiki, so it can be improved where fit. )
General Overview
Our goal here is to track as much differences between two file states while not using redundant data. Thus each informational source must be a disjointed subset of the information of the files state.
The following items represent sources of information about a file:
the name of the file
the directory-path relative to the specified document-root (aka absolute from document-root)
the files permissions
the files owner (user/group)
the last change time
the size of the file
the hostname of the machine the file resides on
the actual saved binary data
Per Item Considerations
Name of File
The name of the file is part of its absolute filesystem's path (the last bit) and as #La-comadreja said, it is unique in that no two files on a system can have the same absolute path. Using the File's name in combination with the rest of its absolute path (see directory-path for more information) is highly encouraged to avoid hash collisions with other files.
Directory-Path
While the files absolute path will be perfectly unique, it should be noted that in certain circumstances hashing the absolute path may be inappropriate. For instance, comparing the hashes of two files on different machines will most likely fail when both files do not have the identical absolute path on both machines. This becomes even more problematic on machines with different OS's and/or architectures. It is therefore encouraged to specify a document-root and resolve an absolute path from there.
Permissions
If you want to track changes to a files permissions, the tests below indicate that you would need to incoporate them into your hash directly as they do not change any other information about the file (most notably the timestamp). Note however that permissions are handled quite differently on different machines, so caution must be exercised here (for instance to use a canonical permission translation scheme).
Ownership
Ownership, just as permissions, is handled very differently across architectures and filesystems. A change of ownership does not change other information (as indicated by the tests below).
timestamp
The timestamp of a file is also something that is not unifiedly implemented across all (or at least the most common) systems. First of all, there are different timestamps on different filesystems we could be looking at: creation date, modified date, access date, etc. For our purpose the modified date is most suitable, as it is supported by most of the available filesystems [1] and holds the exact information we need: the last change to a file. However comparing files across different OS's may pose a problem, as Windows and Unix handle timestamps (in general) differently (see here [2] for a detailed article about the problem). Note that the modification date of a file changes whenever a file has been edited (disregarding edge cases), so timestamp indicates changes in file size (note that the opposite does not hold true, see file-size).
File size
The file size in bytes is an extremely good indication whether a file has been edited (except for permissions, ownership and name changes), as each edit would change the files content, thus changing its size. However this does not hold true if additions to a file are exactly as big as deletions. Thus the files timestamp may be a better indicator. Also, calculating a files binary size may be quite computation intensive.
Hostname
If one wants to compare files across multiple hosts and regard identical files on different hosts as different, then the hostname of the machine (or another suitable unique identifier for the host) should be included in the hash.
Binary Data
The binary data of the file has, of course, all necessary information to check if a file was changed. However, it is also too resource intensive to be of any practicability. It i highly discouraged to use this information.
Suggestions
The following sources should be used to compare files:
the name of the file
the directory path
the timestamp (see above for problems)
The following extra sources can be used to track more information:
permissions (see above)
ownership (see above)
hostname (when comparing across different machines)
The following sources of information should be disregarded:
file size
binary data
Tests
I did some tests on Debian checking whether changing one information would change another. Most interestingly rename, permission change, owner change did not affect a timestamp change or filesize change. (Note that these tests are currently only tested on Debian Linux. Other OS's will likely behave differently.)
$ ls -l
-rw-r--r-- 1 alex alex 30 Apr 26 11:04 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz
-rw-r--r-- 1 alex alex 14 Apr 26 11:04 foo
$ mv baz baz2
$ ls -l
-rw-r--r-- 1 alex alex 30 Apr 26 11:04 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz2
-rw-r--r-- 1 alex alex 14 Apr 26 11:04 foo
$ chmod 777 foo
$ ls -l
-rw-r--r-- 1 alex alex 30 Apr 26 11:04 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz2
-rwxrwxrwx 1 alex alex 14 Apr 26 11:04 foo
$ mv baz2 baz
$ echo "Another string" >> bar
$ ls -l
-rw-r--r-- 1 alex alex 45 Apr 26 11:17 bar
-rw-r--r-- 1 alex alex 0 Apr 26 11:03 baz
-rwxrwxrwx 1 alex alex 14 Apr 26 11:04 foo
$ sudo chown root baz
$ ls -l
-rw-r--r-- 1 alex alex 45 Apr 26 11:17 bar
-rw-r--r-- 1 root alex 0 Apr 26 11:03 baz
-rwxrwxrwx 1 alex alex 14 Apr 26 11:04 foo
Assuming all the files are on the same machine, directory path and file name should produce a unique combination because two files in the same directory cannot have the same name. Directory path, filename and timestamp of last change should capture each change.
If the files are on different machines, the machine name should be included in the directory path.

How to make mkdir make folders in the order I type them, and not have them show up alphabetized

In terminal, I want to make a bunch of folders appear in a certain order. It isn't alphabetical, but in an unrelated order. When I do this:
mkdir this folder is going to be
The folders all show up correctly in Finder, but alphabetized. I have confirmed that the folder's view options are set to Sort By: None, Arrange By: None.
Is there a different way to accomplish this?
If you really want "no order" you will get unpredictable results. Seems like you want file creation date or file modification date (oldest first) order.
This command:
$mkdir this folder is going to be
is misleading, because you have no idea what algorithm 'mkdir' is using internally to create the folders (unless you read the source), so no idea what order is actually going to result, from the point of view of the filesystem (its not the order you expect).
To be clearer, you need to issue the command once per file
$ mkdir this
$ mkdir folder
$ mkdir is
$ mkdir going
$ mkdir to
$ mkdir be
then you can list in reverse modified-date order:
$ ls -tr
this folder is going to be
$ ls -ltr
total 0
drwxr-xr-x 2 user staff 68 6 Jan 20:41 this
drwxr-xr-x 2 user staff 68 6 Jan 20:41 folder
drwxr-xr-x 2 user staff 68 6 Jan 20:41 is
drwxr-xr-x 2 user staff 68 6 Jan 20:41 going
drwxr-xr-x 2 user staff 68 6 Jan 20:41 to
drwxr-xr-x 2 user staff 68 6 Jan 20:41 be
on the native mac filesystem hfs+ there is also a 'creation date' flag, which is probably what you want, but this is not very portable across other filesystems.
IN the finder, arrange by > date created
or arrange by > none, view in list view, with 'date created' column showing, and click on it.
You can't; folders are inherently not ordered.

Resources