The National Speech Corpus is a Natural Language Processing corpus of Singaporean's speaking English, which can be found here: https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus.
When you sign up for the free corpus, you are directed to a dropbox folder. The corpus is 1 TB and (as of this writing) has four parts. I only wanted to download PART 1 but even this has 1446 zip files that are each quite larger. My question is: how do I programmatically download many large files from dropbox onto a Linux (Ubunut 16.04) VM using only the command line.
The directory tree for the relevant part looks like:
root
|-LEXICON
|-PART1
|-DATA
|-CHANNEL0
|-WAVE
|-SPEAKER0001.zip
|-SPEAKER0002.zip
...
|-SPEAKER1446.zip
I looked into a few different approaches:
Downloading the WAVE parent directory using a shared link via the wget command as described in this question. However, this didn't work as I received this error:
Reusing existing connection to www.dropbox.com:443
HTTP request sent, awaiting response... 400 Bad Request
2021-01-06 23:09:06 ERROR 400: Bad Request.
I assumed this was because the WAVE directory was too large for Dropbox to zip.
Based on this post, it was suggested that I could download the HTML of the WAVE parent directory and find all of the direct links to the individual zip files but the direct links to the individual files were not in the HTML file.
Based on the same post as in (2), I could also try to create shared links for each zip file using the dropbox API, though this seemed too cumbersome.
Download the Linux dropbox client and sync the relevant files as outlined in this installation.
In the end, the 4th option did work for me, but I wanted to post this investigation for anyone who needs to download this dataset in the future. Also, I wanted to see if anyone else had better approaches.
As I described, the approach that worked for me was to use Dropbox's linux client to sync the files on to my Linux VM. You can follow these instructions to download the Linux client. These instructions worked for me on my Ubuntu 16.04 VM.
One issue I encounter with the sync client was how to selectively exclude directories. I only had 630 GB on my VM and the entire National Speech Corpus size is 1TB, so I needed to exclude files before the Dropbox sync filled up my disk.
You can selectively exclude files using the dropbox python script that is at the bottom of the installation page. A link to the script is here. Calling the python script from my home directory (where the Dropbox sync folder is automatically installed) worked using the command:
python dropbox.py exclude add ~/Dropbox/<path_to_excluded_dir>
You may want to stop and start the dropbox client which can be done through:
python dropbox.py start
python dropbox.py stop
Finally, see the command in the python script for more information:
python dropbox.py --help
With this approach, I was able to easily download the desired files without overwhelming my VM.
Related
I need to be able to upload a local folder (created daily) to a remote FTP everyday.
I’ve messed with WinSCP file masks (i.e. put -filemask="*>=today" C:\local\ /) and ran into issues where it would upload the latest folder (contained subfiles) but it would also upload the rest of the folders in the directory. (they were empty) I then realized filemasks only works specifically for files, not folders.
I then came across this thread: Download files newer than X days from SFTP server with WinSCP, skipping folders that do not contain any matching files
User had the same issue except he was going Remote -> Local whereas I need the opposite, solution was to use PowerShell
Considering that thread is a couple years old, does WinSCP scripting now support such a feature? Unfortunately I’m a bit of a novice with PowerShell.
Thanks for your time.
WinSCP does not support time constraints for folders.
But what has changed (since the other question) is that now you can prevent WinSCP from creating the empty folders. Use -rawtransfersettings switch with ExcludeEmptyDirectories setting.
put -rawtransfersettings ExcludeEmptyDirectories=1 -filemask="*>=today" C:\local\ /
If you really need to upload the latest folder (as opposite to uploading the folder with the latest files), using WinSCP .NET assembly from your favourite language (like PowerShell) is still the way to go, as shown in the other question.
I'm using Node.js to start Watchman on Windows 2016 with a number of file type filters on a specific directory. This directory is being used for staging. Uploaded files will be routed to other folders depending on the filename.
The problem that I'm having is Watchman is picking up files that are being uploaded. It causes the moving processes to fail as it's locked. I'm thinking about using this package to check the file status (#ronomon/opened) before marking it as a candidate for moving. Is there a better way to do it?
Thanks,
Paul
Please take a look at this issue that sounds almost identical to your question; it has some other alternatives and details beyond what I've put below: https://github.com/facebook/watchman/issues/562#issuecomment-355450096
Summarizing that issue here: you need to allow for the filesystem to settle. There is a settle option you can set in your .watchmanconfig to control this:
{"settle": 60000}
You'd place that file in the upload directory (and make sure that you don't mistake it for an uploaded file and move it out) and then re-create your watch.
I am in an interesting situation where I maintain the code for a program that is used and distributed primarily by our sister company. We are ready to distribute the program to all of the 3rd party users and since it is technically our sister companies program, we want to host it on their website. (in the interest of anonimity, I'll use 'program' everywhere instead of the actual application name, and 'www.SisterCompany.com' instead of their actual URL.)
So I get everything ready to go, setup the Publish setting to check for updates at program start, the minimum required version, and I set the Insallation Folder URL and Update Location to "http://www.SisterCompany.com/apps/program/", with the actual Publishing Folder Location as "C:\LocalProjects\Program\Publish\". Everything else is pretty standard.
After publish, I confirm that everything installs and works correctly when running directly from the publish location on my C: drive. So I put everything on our FTP server, and the guy at our sister company pulls it down and places everything in the '/apps/program/' directory on their webserver.
This is where it goes bad. When I try to install it from their site, I get the - File, Program.exe.config, has a different computed hash than specified in manifest. Error. I tested it a bit, and I even get that error trying to install from any network location on our network other than my local C: drive.
After doing the initial publish in visual studio, I have changed no files (which is the answer/reason I've found by doing some searching about this error).
What could be causing this? Is it because I set the Installation Folder URL to a location that it isn't initially published too?
Let me know if any additional info is needed.
Thanks.
After bashing my head against this all weekend, I have finally found the answer. After unsigning the project and removing the hash on the offending file (an xml file), I got the program to install, but it was giving me 'Windows Side by Side' Errors. I drilled down into the App Cache were the file was, and instead of a config .xml file, it was one of the HTML files from the website the clickonce installer was hosted on. Turns out that the web server didn't seem to like serving up an .XML (or .mdb it turns out) file.
This MSDN article ended up giving me the final solution:
I had to make sure that the 'Use ".deploy" file extension' was selected so that the web server wouldn't mangle files with extensions it didn't like.
I couldn't figure out why that one file's hash would be different. Turns out it wasn't even the same file at all.
It is possible that one of the FTP transfers is happening in text mode, rather than binary?
For me the problem was that .config transformations were done after generating manifest.
To anyone else who's still having trouble, five years later:
The first problem was configuring the MIME type, which on nginx (/etc/nginx/mime.types) should look like this:
application/x-ms-manifest application
See Click Once Server and Client Configuration.
The weirder problem to me was that I was using git to handle the push to the server, i.e.
git remote add live ssh://user#mybox/path/to/publish
git commit -am "committing...";git push live master
Works great for most things, but it was probably being registered as a "change," which prevented the app from installing locally. Once I started using scp instead:
scp -r * user#mybox/path/to/dir/
It worked without a hitch.
It is unfortunate that there is not a lot of helpful information out there about this.
A client's Magento site had weird characters in the top of Magento Connect:
We tried installing a plugin and got the following error:
It turns out the problem was a bunch of (hidden) duplicate PHP files in lib/Mage/Connect. For example, there's Remote.php but there was also ._Remote.php. This forum post was how we found out the details.
(Deleting the duplicate files corrected the problem).
I'm wondering -- has anyone else experienced this duplicate PHP file problem in Magento before? Any idea what the cause is?
These files are mostly likely meta-data files for OS X's HFS+ file system. See this entire thread on the Apple Stack Exchange for some good starting points if you're interested in the details.
Oversimplifying things, when you create a tar archive on OS X these files are included along with the "real" file. This allows Macintosh specific meta data to survive the trip into a file format that wasn't created specifically for the Mac. If you untar the files on a Mac, the meta-data is preserved. If you untar the files on a non-Mac, the ._ files are generated in case the meta data is needed.
My guess is, at some point someone tared up those files to move them to the production server from their Mac, which brought along the the ._ files for the ride. You can avoide this in the future by running
export COPYFILE_DISABLE=true
from the terminal prior to copying the files. Details on this here.
(It's pretty bizarre that PHP would attempt to include those files instead of the correct files — did you debug this far enough to know why/what connect through it was doing?)
After reading Michael Lopp's book "Being Geek," I started using Dropbox as a means of synchronizing files between my home computer and work computer. It's been fantastic, it really makes it painless to keep track of the latest version of files you're working on.
My question has to do with people's experience with this tool, especially programmers who may have used it to develop larger projects.
Right now, I see 3 main uses of Dropbox:
1. synchronize files between home and work computers
2. version control (you have to log into the dropbox site to access previous versions)
3. off-site backup
Right now I'm using it as my main backup tool, which I'm not sure is a good idea. But right now I have a local (working) copy of my entire project "checked out" on each computer (my home laptop and my work computer), and additionally, my entire project is kept on the dropbox site. So I'm thinking, if anything happens to one of my computers, or both, I'll still have that off-site backup available and I'll simply have to reinstall dropbox to access all my files.
Does anyone have experience with doing this? Has anyone done a major file recovery using dropbox? Or is this even widely used? Thanks for your feedback in advance.
Using Dropbox to maintain several files and its associated metadata when those files are historized in a VCS is always a bit tricky because of potential corruption issue (if one of those metadata part of the repository isn't correctly synchronized, you can end up with a non_working repo)
That is why I always use with DropBox:
a DVCS (like Git): I can work directly in a working tree within a DropBox repo or I can clone said repo anywhere else outside the DropBox if I need to,
a single bundle file to which I can push at any time the changes from my local repo, wherever that repo might be.
That way, the only file that really need to be in sync in DropBox is that unique bundle file (representing a bare repo as one file).
See "Git with DropBox" for more.