Remove macros etc from office documents via ruby - ruby

Is there a way of specifying components to remove from MS or Openoffice documents via ruby? I'm talking about removing macros/meta information and also removing/replacing images. I've looked at a number of conversion programs with a view to doing a conversion from/to the same file format, but I can't find any that allow such options to be specified.
I've looked at:
Convert_office
Abiword - I've modified the original gem to allow conversion to doc as well as pdf.

Docx files are really zip files. You can unzip them (inflate) into a directory and delete or change the files you need, and update references to those files. The files inside the zip are text files, XML, so you can use LibXML-Ruby or Nokogiri.

Related

Wget download many files in the sublink of a webpage

I am trying to download many files (~30,000) using wget, all files are in the following webpage:
http://galex.stsci.edu/gr6/?page=tilelist&survey=ais&showall=Y
However, the real data is under a sublink after I click Fits and then some file under this sublink is displayed. For example, the sublink of the first file is the following:
http://galex.stsci.edu/gr6/?page=downloadlist&tilenum=50270&type=coaddI&subvis=28&img=1
I only want to download one file in this sublink: Intensity Map of band NUV. In this above case, it is the second file that I want to download.
All files have the same structure. How could I use wget to download all the files under sublink?
The Intensity Map of band NUV files have a common ending, which should allow you to download only the files you want using wget -r -A "*nd-int.fits.gz" on the target site. This employs wget's recursive function, -r, and the Accept List function, -A. The Accept List function, outlined here, will only download the files you want according to extension, name, or naming convention. Whether the wget recursive function can successfully crawl the entirety of your target site is something you'll have to test.
If the above doesn't work, the website seems to have handy tools for filtering available files, such as a catalog search.

ChemSpider refuses to accept the .MOL file I present it

I converted a .pdb file to a .MOL file through BABEL (Converter Software). I do get the .MOL file, but when I submit the file online for a similar structure search It doesn't even load the file in ChemSpider.
To use PubChem's database you need SMILES format, which BABEL cannot properly convert my .pdb file to. So I'm out of luck there.
Any way I can search my .MOL file on any chemical database that's out there?
Thanks

Get website image info/metadata

So I have scoured google for mention of anybody trying to use powershell to get information about files from a URL/URI but with no luck. I have found ways to get metadata of files from a local source but nothing for an image hosted on a website.
What I want to do:
I have a list of image URL's eg. www.website/images/img.jpg and want to grab the metadata without having to download the entire image. I would then store and export this info to a csv to look over later.
So far my code has been resigned to System.Net.Webclient.DownloadFile() and then operating on them locally. Is it possible to do this remotely?
I suppose you're referring to EXIF metadata. Those are embedded in the file, so unless the remote host provides an API that exposes this information you must download the file to be able to read the information.
Judging from what I gleaned from the standard the information is stored at the beginning of the file, so you could try to download just the first couple hundred bytes. However, the size of the EXIF header doesn't seem to be fixed, so you'll want to retrieve a large enough chunk. Also, standard EXIF parsers might not work on incomplete images, so you might need to write your own parser.
All in all I'd say downloading the entire file and extracting the information with standard tools is your best option.

Is there a gem that can be used to list the contents of non-zip archives?

I'm working on a project that needs to get a file list from a variety of different archives files (tar.gz, rar, tar.bz2, and zip) without expanding the archive. Rubyzip works well for zip files, but I can't find any equivalent for the other formats. Any suggestions?
Edit: I forgot to mention that this needs to be cross-platform, so I can't fall back on outside tools.
I don't know of something which handles all formats, but you could do this with a shell call and a little bit of parsing of the result.

Can VS_VERSION_INFO be added to non-exe files?

My windows co-workers were asking me if I could modify my non-windows binary files such that when their "Properties" are examined under Windows, they could see a "Version" tab like that which would show for a Visual Studio compiled exe.
Specifically, I have some gzipped binary files and was wondering if I could modify them to satisfy this demand. If there's a better way, that would be fine, too.
Is there a way I could make my binaries appear to be exe files?
I tried simply appending the VS_VERSION_INFO block from notepad.exe to the end of one of my binaries in the hope that Windows scans for the block, but it didn't work.
I tried editing the other information regarding Author, Subject, Revision, etc. That doesn't modify the file, it just creates another data fork(what's the windows term?) for the file in NTFS.
It is not supported by windows, since each file type has their own file format. But that doesn't mean you can't accomplish it. The resources stored inside dlls and exes are part of the file format.
Display to the user:
If you wanted this information to be displayed to the user, this would probably be best accomplished with using a property page shell extension. You would create a similar looking page, but it wouldn't be using the exact same page. There is a really good multi part tutorial on shell extensions, including property pages starting with that link.
Where to actually store the resource:
Instead of appending a block to the file, you could store the resource into a separate alternate data stream on the same file. This would leave the original file stream non corrupted on disk and not cause its primary file size to change.
Alternate data streams allow more than one data stream to be associated with a filename. Each stream is identified by a colon : at the end of the filename and an identifier.
You can create them for example by doing:
notepad test.txt:adsname1
notepad test.txt:adsname2
notepad test.txt
Getting the normal Win32 APIs working:
If you wanted the normal API to work, you'd have to intercept the Win32 APIs: LoadLibraryEx, FindResource, LoadResource and LockResource. This is probably not worth the trouble though since you are already creating your own property page.
Can't think of any way to do this short of a shell extension. The approach I've taken in the past is a separate "census" program that knows how to read version information from any kind of file.
Zip files can be converted into exe files by using a program that turns a zip file into a self-extracting zip (I know that WinZip does this, there are most likely free utilities for this also; here's one that came up on a search but I haven't actually tried it). Once you've got an exe, you should be able to use a tool like Resource Hacker to change the version information.
It won't work. Either Windows would have to know every file format or no file format would be disturbed if version information were appended to it.
No, resource section is only expected inside PE (portable executable; exe, dll, sys).
It is more then just putting the data inside the file, you have a table that points to the data in the file header.
What you can do if you have NTFS drive, is to use NTFS stream to store custom properties this way the contact of the binary file will remain the same, but you will need to use a custom shell extension to show the content of the stream.

Resources