Determine file type in Ruby - ruby

How does one reliably determine a file's type? File extension analysis is not acceptable. There must be a rubyesque tool similar to the UNIX file(1) command?
This is regarding MIME or content type, not file system classifications, such as directory, file, or socket.

There is a ruby binding to libmagic that does what you need. It is available as a gem named ruby-filemagic:
gem install ruby-filemagic
Require libmagic-dev.
The documentation seems a little thin, but this should get you started:
$ irb
irb(main):001:0> require 'filemagic'
=> true
irb(main):002:0> fm = FileMagic.new
=> #<FileMagic:0x7fd4afb0>
irb(main):003:0> fm.file('foo.zip')
=> "Zip archive data, at least v2.0 to extract"
irb(main):004:0>

If you're on a Unix machine try this:
mimetype = `file -Ib #{path}`.gsub(/\n/,"")
I'm not aware of any pure Ruby solutions that work as reliably as 'file'.
Edited to add: depending what OS you are running you may need to use 'i' instead of 'I' to get file to return a mime-type.

I found shelling out to be the most reliable. For compatibility on both Mac OS X and Ubuntu Linux I used:
file --mime -b myvideo.mp4
video/mp4; charset=binary
Ubuntu also prints video codec information if it can which is pretty cool:
file -b myvideo.mp4
ISO Media, MPEG v4 system, version 2

You can use this reliable method base on the magic header of the file :
def get_image_extension(local_file_path)
png = Regexp.new("\x89PNG".force_encoding("binary"))
jpg = Regexp.new("\xff\xd8\xff\xe0\x00\x10JFIF".force_encoding("binary"))
jpg2 = Regexp.new("\xff\xd8\xff\xe1(.*){2}Exif".force_encoding("binary"))
case IO.read(local_file_path, 10)
when /^GIF8/
'gif'
when /^#{png}/
'png'
when /^#{jpg}/
'jpg'
when /^#{jpg2}/
'jpg'
else
mime_type = `file #{local_file_path} --mime-type`.gsub("\n", '') # Works on linux and mac
raise UnprocessableEntity, "unknown file type" if !mime_type
mime_type.split(':')[1].split('/')[1].gsub('x-', '').gsub(/jpeg/, 'jpg').gsub(/text/, 'txt').gsub(/x-/, '')
end
end

This was added as a comment on this answer but should really be its own answer:
path = # path to your file
IO.popen(
["file", "--brief", "--mime-type", path],
in: :close, err: :close
) { |io| io.read.chomp }
I can confirm that it worked for me.

If you're using the File class, you can augment it with the following functions based on #PatrickRichie's answer:
class File
def mime_type
`file --brief --mime-type #{self.path}`.strip
end
def charset
`file --brief --mime #{self.path}`.split(';').second.split('=').second.strip
end
end
And, if you're using Ruby on Rails, you can drop this into config/initializers/file.rb and have available throughout your project.

For those who came here by the search engine, a modern approach to find the MimeType in pure ruby is to use the mimemagic gem.
require 'mimemagic'
MimeMagic.by_magic(File.open('tux.jpg')).type # => "image/jpeg"
If you feel that is safe to use only the file extension, then you can use the mime-types gem:
MIME::Types.type_for('tux.jpg') => [#<MIME::Type: image/jpeg>]

You could give shared-mime a try (gem install shared-mime-info). Requires the use ofthe Freedesktop shared-mime-info library, but does both filename/extension checks as well as "magic" checks... tried giving it a whirl myself just now but I don't have the freedesktop shared-mime-info database installed and have to do "real work," unfortunately, but it might be what you're looking for.

Pure Ruby solution using magic bytes and returning a symbol for the matching type:
https://github.com/SixArm/sixarm_ruby_magic_number_type
I wrote it, so if you have suggestions, let me know.

I recently found mimetype-fu.
It seems to be the easiest reliable solution to get a file's MIME type.
The only caveat is that on a Windows machine it only uses the file extension, whereas on *Nix based systems it works great.

The best I found so far:
http://bogomips.org/mahoro.git/

The ruby gem is well.
mime-types for ruby

You could give a go with MIME::Types for Ruby.
This library allows for the identification of a file’s likely MIME content type. The identification of MIME content type is based on a file’s filename extensions.

Related

Ruby 'require' error: cannot load such file

I've one file, main.rb with the following content:
require "tokenizer.rb"
The tokenizer.rb file is in the same directory and its content is:
class Tokenizer
def self.tokenize(string)
return string.split(" ")
end
end
If i try to run main.rb I get the following error:
C:\Documents and Settings\my\src\folder>ruby main.rb
C:/Ruby193/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require': cannot load such file -- tokenizer.rb (LoadError)
from C:/Ruby193/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `require '
from main.rb:1:in `<main>'
I just noticed that if I use load instead of require everything works fine. What may the problem be here?
I just tried and it works with require "./tokenizer".
Just do this:
require_relative 'tokenizer'
If you put this in a Ruby file that is in the same directory as tokenizer.rb, it will work fine no matter what your current working directory (CWD) is.
Explanation of why this is the best way
The other answers claim you should use require './tokenizer', but that is the wrong answer, because it will only work if you run your Ruby process in the same directory that tokenizer.rb is in. Pretty much the only reason to consider using require like that would be if you need to support Ruby 1.8, which doesn't have require_relative.
The require './tokenizer' answer might work for you today, but it unnecessarily limits the ways in which you can run your Ruby code. Tomorrow, if you want to move your files to a different directory, or just want to start your Ruby process from a different directory, you'll have to rethink all of those require statements.
Using require to access files that are on the load path is a fine thing and Ruby gems do it all the time. But you shouldn't start the argument to require with a . unless you are doing something very special and know what you are doing.
When you write code that makes assumptions about its environment, you should think carefully about what assumptions to make. In this case, there are up to three different ways to require the tokenizer file, and each makes a different assumption:
require_relative 'path/to/tokenizer': Assumes that the relative path between the two Ruby source files will stay the same.
require 'path/to/tokenizer': Assumes that path/to/tokenizer is inside one of the directories on the load path ($LOAD_PATH). This generally requires extra setup, since you have to add something to the load path.
require './path/to/tokenizer': Assumes that the relative path from the Ruby process's current working directory to tokenizer.rb is going to stay the same.
I think that for most people and most situations, the assumptions made in options #1 and #2 are more likely to hold true over time.
Ruby 1.9 has removed the current directory from the load path, and so you will need to do a relative require on this file, as David Grayson says:
require_relative 'tokenizer'
There's no need to suffix it with .rb, as Ruby's smart enough to know that's what you mean anyway.
require loads a file from the $LOAD_PATH. If you want to require a file relative to the currently executing file instead of from the $LOAD_PATH, use require_relative.
I would recommend,
load './tokenizer.rb'
Given, that you know the file is in the same working directory.
If you're trying to require it relative to the file, you can use
require_relative 'tokenizer'
I hope this helps.
Another nice little method is to include the current directory in your load path with
$:.unshift('.')
You could push it onto the $: ($LOAD_PATH) array but unshift will force it to load your current working directory before the rest of the load path.
Once you've added your current directory in your load path you don't need to keep specifying
require './tokenizer'
and can just go back to using
require 'tokenizer'
This will work nicely if it is in a gem lib directory and this is the tokenizer.rb
require_relative 'tokenizer/main'
For those who are absolutely sure their relative path is correct, my problem was that my files did not have the .rb extension! (Even though I used RubyMine to create the files and selected that they were Ruby files on creation.)
Double check the file extensions on your file!
What about including the current directory in the search path?
ruby -I. main.rb
I used jruby-1.7.4 to compile my ruby code.
require 'roman-numerals.rb'
is the code which threw the below error.
LoadError: no such file to load -- roman-numerals
require at org/jruby/RubyKernel.java:1054
require at /Users/amanoharan/.rvm/rubies/jruby-1.7.4/lib/ruby/shared/rubygems/custom_require.rb:36
(root) at /Users/amanoharan/Documents/Aptana Studio 3 Workspace/RubyApplication/RubyApplication1/Ruby2.rb:2
I removed rb from require and gave
require 'roman-numerals'
It worked fine.
The problem is that require does not load from the current directory. This is what I thought, too but then I found this thread. For example I tried the following code:
irb> f = File.new('blabla.rb')
=> #<File:blabla.rb>
irb> f.read
=> "class Tokenizer\n def self.tokenize(string)\n return string.split(
\" \")\n end\nend\n"
irb> require f
LoadError: cannot load such file -- blabla.rb
from D:/dev/Ruby193/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `req
uire'
from D:/dev/Ruby193/lib/ruby/1.9.1/rubygems/custom_require.rb:36:in `req
uire'
from (irb):24
from D:/dev/Ruby193/bin/irb:12:in `<main>'
As it can be seen it read the file ok, but I could not require it (the path was not recognized). and here goes code that works:
irb f = File.new('D://blabla.rb')
=> #<File:D://blabla.rb>
irb f.read
=> "class Tokenizer\n def self.tokenize(string)\n return string.split(
\" \")\n end\nend\n"
irb> require f
=> true
As you can see if you specify the full path the file loads correctly.
First :
$ sudo gem install colored2
And,you should input your password
Then :
$ sudo gem update --system
Appear
Updating rubygems-update
ERROR: While executing gem ... (OpenSSL::SSL::SSLError)
hostname "gems.ruby-china.org" does not match the server certificate
Then:
$ rvm -v
$ rvm get head
Last
What language do you want to use?? [ Swift / ObjC ]
ObjC
Would you like to include a demo application with your library? [ Yes / No ]
Yes
Which testing frameworks will you use? [ Specta / Kiwi / None ]
None
Would you like to do view based testing? [ Yes / No ]
No
What is your class prefix?
XMG
Running pod install on your new library.
you need to give the path.
Atleast you should give the path from the current directory. It will work for sure.
./filename

Why isn't this glob working for a Rake FileList for my server?

How come I get an empty filelist from:
files = FileList.new("#{DEPLOYMENT_PATH}\**\*")
Where DEPLOYMENT_PATH is \\myserver\anndsomepath
How to get a filelist from a server like this? Is this an issue of Ruby/Rake?
UPDATE:
I tried:
files = FileList.new("#{DEPLOYMENT_PATH}\\**\\*")
files = Dir.glob("#{DEPLOYMENT_PATH}\\**\\*")
files = Dir.glob("#{DEPLOYMENT_PATH}\**\*")
UPDATE AGAIN: It works if I put server as:
//myserver/andsomepath
and get files like this:
files = FileList.new("#{DEPLOYMENT_PATH}/**/*")
Ruby' File.join is designed to be your helper when dealing with file paths, by building them in a system-independent way:
File.join('a','b','c')
=> "a/b/c"
So:
DEPLOYMENT_PATH = File.join('', 'myserver', 'andsomepath')
=> "/myserver/andsomepath"
Ruby determines the file path separator by sensing the OS, and is supposed to automatically supply the right value. On Windows XP, Linux and Mac OS it is:
File::SEPARATOR
=> "/"
File.join(DEPLOYMENT_PATH, '**', '*')
=> "/myserver/andsomepath/**/*"
While you can ignore the helper, it is there to make your life easier. Because you are working against a server, you might want to look into File::ALT_SEPARATOR, or just reassigning to SEPARATOR and ignore the warning, letting Ruby do the rest.
What happens if you do
Dir.glob("#{DEPLOYMENT_PATH}\**\*")
Edit: I think Ruby prefers you doing Unix-style slashes, even when you're on Windows. I assume the rationale is that it's better for the same code to work on both Unix and Windows, even if it looks weird on Windows.
tl;dr: If it works with / but not with \, then use what works.
Because:
> "\*" == "*"
=> true
Use "\\**\\*" instead.

Unicode filenames on Windows in Ruby

I have a piece of code that looks like this:
Dir.new(path).each do |entry|
puts entry
end
The problem comes when I have a file named こんにちは世界.txt in the directory that I list.
On a Windows 7 machine I get the output:
???????.txt
From googling around, properly reading this filename on windows seems to be an impossible task. Any suggestions?
I had the same problem & just figured out how to get the entries of a directory in UTF-8 in Windows. The following worked for me (using Ruby 1.9.2p136):
opts = {}
opts[:encoding] = "UTF-8"
entries = Dir.entries(path, opts)
entries.each do |entry|
# example
stat = File::stat(entry)
puts "Size: " + String(stat.size)
end
You're out of luck with pure ruby (either 1.8 or 1.9.1) since it uses the ANSI versions of the Windows API.
It seems like Ruby 1.9.2 will support Unicode filenames on Windows. This bug report has 1.9.2 as target. According to this announcement Ruby 1.9.2 will be released at the end of July 2010.
If you really need it earlier you could try to use FindFirstFileW etc. directly via Win32API.new or win32-api.
My solution was to use Dir.glob instead of Dir.entries. But it only works with * parameter. It does not work when passing a path (c:/dir/*). Tested in 1.9.2p290 and 1.9.3p0 on Windows 7.
There are many other issues with unicode paths on Windows. It is still an open issue. The patches are currently targeted at Ruby 2.0, which is rumored to be released in 2013.

Ruby: How to determine if file being read is binary or text

I am writing a program in Ruby which will search for strings in text files within a directory - similar to Grep.
I don't want it to attempt to search in binary files but I can't find a way in Ruby to determine whether a file is binary or text.
The program needs to work on both Windows and Linux.
If anyone could point me in the right direction that would be great.
Thanks,
Xanthalas
libmagic is a library which detects filetypes. For this solution I assume, that all mimetype's which start with text/ represent text files. Eveything else is a binary file. This assumption is not correct for all mime types (eg. application/x-latex, application/json), but libmagic detect's these as text/plain.
require "filemagic"
def binary?(filename)
begin
fm= FileMagic.new(FileMagic::MAGIC_MIME)
!(fm.file(filename)=~ /^text\//)
ensure
fm.close
end
end
gem install ptools
require 'ptools'
File.binary?(file)
An alternative to using the ruby-filemagic gem is to rely on the file command that ships with most Unix-like operating systems. I believe it uses the same libmagic library under the hood but you don't need the development files required to compile the ruby-filemagic gem. This is helpful if you're in an environment where it's a bit of work to install additional libraries (e.g. Heroku).
According to man file, text files will usually contain the word text in their description:
$ file Gemfile
Gemfile: ASCII text
You can run the file command through Ruby can capture the output:
require "open3"
def text_file?(filename)
file_type, status = Open3.capture2e("file", filename)
status.success? && file_type.include?("text")
end
Updating above answer with such example, when file name includes "text":
file /tmp/ball-texture.png
/tmp/ball-texture.png: PNG image data, 11 x 18, 8-bit/color RGBA, non-interlaced
So updated code will be like:
def text_file?(filename)
file_type, status = Open3.capture2e('file', filename)
status.success? && file_type.split(':').last.include?('text')
end

Ruby path management

What is the best way to manage the require paths in a ruby program?
Let me give a basic example, consider a structure like:
\MyProgram
\MyProgram\src\myclass.rb
\MyProgram\test\mytest.rb
If in my test i use require '../src/myclass' then I can only call the test from \MyProgram\test folder, but I want to be able to call it from any path!
The solution I came up with is to define in all source files the following line:
ROOT = "#{File.dirname(__FILE__)}/.." unless defined?(ROOT) and then always use require "#{ROOT}/src/myclass"
Is there a better way to do it?
As of Ruby 1.9 you can use require_relative to do this:
require_relative '../src/myclass'
If you need this for earlier versions you can get it from the extensions gem as per this SO comment.
Here is a slightly modified way to do it:
$LOAD_PATH.unshift File.expand_path(File.join(File.dirname(__FILE__), "..", "src"))
By prepending the path to your source to $LOAD_PATH (aka $:) you don't have to supply the root etc. explicitly when you require your code i.e. require 'myclass'
The same, less noisy IMHO:
$:.unshift File.expand_path("../../src", __FILE__)
require 'myclass'
or just
require File.expand_path "../../src/myclass", __FILE__
Tested with ruby 1.8.7 and 1.9.0 on (Debian) Linux - please tell me if it works on Windows, too.
Why a simpler method (eg. 'use', 'require_relative', or sg like this) isn't built into the standard lib? UPDATE: require_relative is there since 1.9.x
Pathname(__FILE__).dirname.realpath
provides a the absolute path in a dynamic way.
Use following code to require all "rb" files in specific folder (=> Ruby 1.9):
path='../specific_folder/' # relative path from current file to required folder
Dir[File.dirname(__FILE__) + '/'+path+'*.rb'].each do |file|
require_relative path+File.basename(file) # require all files with .rb extension in this folder
end
sris's answer is the standard approach.
Another way would be to package your code as a gem. Then rubygems will take care of making sure your library files are in your path.
This is what I ended up with - a Ruby version of a setenv shell script:
# Read application config
$hConf, $fConf = {}, File.expand_path("../config.rb", __FILE__)
$hConf = File.open($fConf) {|f| eval(f.read)} if File.exist? $fConf
# Application classpath
$: << ($hConf[:appRoot] || File.expand_path("../bin/app", __FILE__))
# Ruby libs
$lib = ($hConf[:rubyLib] || File.expand_path("../bin/lib", __FILE__))
($: << [$lib]).flatten! # lib is string or array, standardize
Then I just need to make sure that this script is called once before anything else, and don't need to touch the individual source files.
I put some options inside a config file, like the location of external (non-gem) libraries:
# Site- and server specific config - location of DB, tmp files etc.
{
:webRoot => "/srv/www/myapp/data",
:rubyLib => "/somewhere/lib",
:tmpDir => "/tmp/myapp"
}
This has been working well for me, and I can reuse the setenv script in multiple projects just by changing the parameters in the config file. A much better alternative than shell scripts, IMO.

Resources