Ruby: How to determine if file being read is binary or text - ruby

I am writing a program in Ruby which will search for strings in text files within a directory - similar to Grep.
I don't want it to attempt to search in binary files but I can't find a way in Ruby to determine whether a file is binary or text.
The program needs to work on both Windows and Linux.
If anyone could point me in the right direction that would be great.
Thanks,
Xanthalas

libmagic is a library which detects filetypes. For this solution I assume, that all mimetype's which start with text/ represent text files. Eveything else is a binary file. This assumption is not correct for all mime types (eg. application/x-latex, application/json), but libmagic detect's these as text/plain.
require "filemagic"
def binary?(filename)
begin
fm= FileMagic.new(FileMagic::MAGIC_MIME)
!(fm.file(filename)=~ /^text\//)
ensure
fm.close
end
end

gem install ptools
require 'ptools'
File.binary?(file)

An alternative to using the ruby-filemagic gem is to rely on the file command that ships with most Unix-like operating systems. I believe it uses the same libmagic library under the hood but you don't need the development files required to compile the ruby-filemagic gem. This is helpful if you're in an environment where it's a bit of work to install additional libraries (e.g. Heroku).
According to man file, text files will usually contain the word text in their description:
$ file Gemfile
Gemfile: ASCII text
You can run the file command through Ruby can capture the output:
require "open3"
def text_file?(filename)
file_type, status = Open3.capture2e("file", filename)
status.success? && file_type.include?("text")
end

Updating above answer with such example, when file name includes "text":
file /tmp/ball-texture.png
/tmp/ball-texture.png: PNG image data, 11 x 18, 8-bit/color RGBA, non-interlaced
So updated code will be like:
def text_file?(filename)
file_type, status = Open3.capture2e('file', filename)
status.success? && file_type.split(':').last.include?('text')
end

Related

Ruby writing zip file works on Mac but not on windows / How to recieve zip file in Net::HTTP

actually i'm writing a ruby script which accesses an API based on HTTP-POST calls.
The API returns a zip file containing textdocuments when i call it with specific POST-Parameters. At the moment i'm doing that with the Net::HTTP Package.
Now my problem:
It seems to return the zip-file as a string as far as i know. I can see "PK" (which i suppose is part of the PK-Header of zip-files) and the text from the documents.
And the Content-Type Header is telling me "application/x-zip-compressed; name="somename.zip"".
When i save the zip file like so:
result = comodo.get_cert("<somenumber>")
puts result['Content-Type']
puts result.inspect
puts result.body
File.open("test.zip", "w") do |file|
file.write result.body
end
I can unzip it on my macbook without further problems. But when i run the same code on my Win10 PC it tells me that the file is corrupt or not a ZIP-file.
Has it something to do with the encoding? Can i change it, so it's working on both?
Or is it a complete wrong approach on how to recieve a zip-file from a POST-request?
PS:
My ruby-version on Mac:
ruby 2.2.3p173
My ruby-version on Windows:
ruby 2.2.4p230
Many thanks in advance!
The problem is due to the way Windows handles line endings (\r\n for Windows, whereas OS X and other Unix based operating systems use just \n). When using File.open, using the mode of just w makes the file subject to line ending changes, so any occurrences of byte 0x0A (or \n) are converted into bytes 0x0D 0x0A (or \r\n), which effectively breaks the zip.
When opening the file for write, use the mode wb instead, as this will suppress any line ending changes.
http://ruby-doc.org/core-2.2.0/IO.html#method-c-new-label-IO+Open+Mode
Many thanks! Just as you posted the solution i found it out myself..
So much trouble because of one missing 'b' :/
Thank you very much!
The solution (see Ben Y's answer):
result = comodo.get_cert("<somenumber>")
puts result['Content-Type']
puts result.inspect
puts result.body
File.open("test.zip", "wb") do |file|
file.write result.body
end

Encoding problems with ruby while reading in command line arguments with optparse

I'm writing a small programm in ruby, which essentially changes some files within a zip-file. The zip-file is specified as a parameter on the command line and interpreted via the OptionParser.
The problem is, that when specifiying a file, which contains non-ascii characters, the file cannot be opened, saying that it could not be found. This problem occurs using cmd.exe under Windows.
Here is a minimal example:
# example.rb
require "zip"
require "optparse"
zip_file_name = String.new
# read and interprete command line arguments:
OptionParser.new do |opts|
opts.on("-f", "--file FILE", String, "The zip-file, which will be modified") do |f|
zip_file_name = f
end
end.parse!
# Open the zip file:
Zip::File.open(zip_file_name) do |zipfile|
end
If you create a zip-file test.zip and run example.rb -f test.zip everything is okay (it does finish without errors). Doing the same with a zip-file täst.zip gives me an error. I tried doing zip_file_name.encode!(Encoding::UTF_8), but this didn't solve the problem.
It seems to be an encoding problem (the encoding of zip_file_name is cp850) but the transcoding does not seem to work correctly.
So my question would be: How can I change my program to also allow non-ascii characters for specifying files on the command line?
Adding zip_file_name.force_encoding(Encoding::Windows_1252) before opening the file solves the issue (on Western Europe Windows).
Apparently, the CP850 file names encoding is a wrong assumption from Ruby. On my Windows system, it seems that filenames are encoded in Windows_1252 (a custom version of Latin1 or ISO 8859-1).

A ruby script to run tail on a log file?

I want to write a ruby script that read from a config file that will have filenames, and then when I run the script it will take the tail of each file and output the console.
What's the best way to go about doing this?
Take a look at File::Tail gem.
You can invoke linux tail -number_of_lines file_name command from your ruby script and let it print on console or capture output and print it yourself (if you need to do something with these lines before you print it)
We have a configuration file that contain a list of the log files; for example, like this:
---
- C:\fe\logs\front_end.log
- C:\mt\logs\middle_tier.log
- C:\be\logs\back_end.log
The format of the configuration file is a yaml simple sequence , therefore suppose we named this file 'settings.yaml'
The ruby script that take the tail of each file and output the console could be like this:
require 'yaml'
require 'file-tail'
logs = YAML::load(File.open('settings.yaml'))
threads = []
logs.each do |the_log|
threads << Thread.new(the_log) { |log_filename|
File.open(log_filename) do |log|
log.extend(File::Tail)
log.interval = 10
log.backward(10)
log.tail { |line| p "#{File.basename(the_log,".log")} - #{line}" }
end
}
end
threads.each { |the_thread| the_thread.join }
Note: displaying each line I wanted to prefix it with the name of the file from which it originates, ...this for me is a good option but you can edit the script to change as you like ; is the same for the tails parameters.
if file-tail is missing in your environment, follow the link as #Mark Thomas posts in his answear; i.e you need to:
> gem install file-tail
I found the file-tail gem to be a bit buggy. I would write to a file and it would read the entire file again instead of just thelines appended. This happened even though I had log.backward set to 0. I ended up writing my own and figured that I would share it here in case any one else is looking for a Ruby alternative to the file-tail gem. You can find the repo here. It uses non_blocking io, so it will catch amendments to the file immediately. There is one caveat that can be easily fixed if you can program in the Ruby programming language; log.backward is hard coded to be -1.

Automatically open a file as binary with Ruby

I'm using Ruby 1.9 to open several files and copy them into an archive. Now there are some binary files, but some are not. Since Ruby 1.9 does not open binary files automatically as binaries, is there a way to open them automatically anyway? (So ".class" would be binary, ".txt" not)
Actually, the previous answer by Alex D is incomplete. While it's true that there is no "text" mode in Unix file systems, Ruby does make a difference between opening files in binary and non-binary mode:
s = File.open('/tmp/test.jpg', 'r') { |io| io.read }
s.encoding
=> #<Encoding:UTF-8>
is different from (note the "rb")
s = File.open('/tmp/test.jpg', 'rb') { |io| io.read }
s.encoding
=> #<Encoding:ASCII-8BIT>
The latter, as the docs say, set the external encoding to ASCII-8BIT which tells Ruby to not attempt to interpret the result at UTF-8. You can achieve the same thing by setting the encoding explicitly with s.force_encoding('ASCII-8BIT'). This is key if you want to read binary into a string and move them around (e.g. saving them to a database, etc.).
Since Ruby 1.9.1 there is a separate method for binary reading (IO.binread) and since 1.9.3 there is one for writing (IO.binwrite) as well:
For reading:
content = IO.binread(file)
For writing:
IO.binwrite(file, content)
Since IO is the parent class of File, you could also do the following which is probably more expressive:
content = File.binread(file)
File.binwrite(file, content)
On Unix-like platforms, there is no difference between opening files in "binary" and "text" modes. On Windows, "text" mode converts line breaks to DOS style, and "binary" mode does not.
Unless you need linebreak conversion on Windows platforms, just open all the files in "binary" mode. There is no harm in reading a text file in "binary" mode.
If you really want to distinguish, you will have to match File.extname(filename) against a list of known extensions like ".txt" and ".class".

Determine file type in Ruby

How does one reliably determine a file's type? File extension analysis is not acceptable. There must be a rubyesque tool similar to the UNIX file(1) command?
This is regarding MIME or content type, not file system classifications, such as directory, file, or socket.
There is a ruby binding to libmagic that does what you need. It is available as a gem named ruby-filemagic:
gem install ruby-filemagic
Require libmagic-dev.
The documentation seems a little thin, but this should get you started:
$ irb
irb(main):001:0> require 'filemagic'
=> true
irb(main):002:0> fm = FileMagic.new
=> #<FileMagic:0x7fd4afb0>
irb(main):003:0> fm.file('foo.zip')
=> "Zip archive data, at least v2.0 to extract"
irb(main):004:0>
If you're on a Unix machine try this:
mimetype = `file -Ib #{path}`.gsub(/\n/,"")
I'm not aware of any pure Ruby solutions that work as reliably as 'file'.
Edited to add: depending what OS you are running you may need to use 'i' instead of 'I' to get file to return a mime-type.
I found shelling out to be the most reliable. For compatibility on both Mac OS X and Ubuntu Linux I used:
file --mime -b myvideo.mp4
video/mp4; charset=binary
Ubuntu also prints video codec information if it can which is pretty cool:
file -b myvideo.mp4
ISO Media, MPEG v4 system, version 2
You can use this reliable method base on the magic header of the file :
def get_image_extension(local_file_path)
png = Regexp.new("\x89PNG".force_encoding("binary"))
jpg = Regexp.new("\xff\xd8\xff\xe0\x00\x10JFIF".force_encoding("binary"))
jpg2 = Regexp.new("\xff\xd8\xff\xe1(.*){2}Exif".force_encoding("binary"))
case IO.read(local_file_path, 10)
when /^GIF8/
'gif'
when /^#{png}/
'png'
when /^#{jpg}/
'jpg'
when /^#{jpg2}/
'jpg'
else
mime_type = `file #{local_file_path} --mime-type`.gsub("\n", '') # Works on linux and mac
raise UnprocessableEntity, "unknown file type" if !mime_type
mime_type.split(':')[1].split('/')[1].gsub('x-', '').gsub(/jpeg/, 'jpg').gsub(/text/, 'txt').gsub(/x-/, '')
end
end
This was added as a comment on this answer but should really be its own answer:
path = # path to your file
IO.popen(
["file", "--brief", "--mime-type", path],
in: :close, err: :close
) { |io| io.read.chomp }
I can confirm that it worked for me.
If you're using the File class, you can augment it with the following functions based on #PatrickRichie's answer:
class File
def mime_type
`file --brief --mime-type #{self.path}`.strip
end
def charset
`file --brief --mime #{self.path}`.split(';').second.split('=').second.strip
end
end
And, if you're using Ruby on Rails, you can drop this into config/initializers/file.rb and have available throughout your project.
For those who came here by the search engine, a modern approach to find the MimeType in pure ruby is to use the mimemagic gem.
require 'mimemagic'
MimeMagic.by_magic(File.open('tux.jpg')).type # => "image/jpeg"
If you feel that is safe to use only the file extension, then you can use the mime-types gem:
MIME::Types.type_for('tux.jpg') => [#<MIME::Type: image/jpeg>]
You could give shared-mime a try (gem install shared-mime-info). Requires the use ofthe Freedesktop shared-mime-info library, but does both filename/extension checks as well as "magic" checks... tried giving it a whirl myself just now but I don't have the freedesktop shared-mime-info database installed and have to do "real work," unfortunately, but it might be what you're looking for.
Pure Ruby solution using magic bytes and returning a symbol for the matching type:
https://github.com/SixArm/sixarm_ruby_magic_number_type
I wrote it, so if you have suggestions, let me know.
I recently found mimetype-fu.
It seems to be the easiest reliable solution to get a file's MIME type.
The only caveat is that on a Windows machine it only uses the file extension, whereas on *Nix based systems it works great.
The best I found so far:
http://bogomips.org/mahoro.git/
The ruby gem is well.
mime-types for ruby
You could give a go with MIME::Types for Ruby.
This library allows for the identification of a file’s likely MIME content type. The identification of MIME content type is based on a file’s filename extensions.

Resources