How can I detect the programming language of a snippet?

How can I detect the programming language of a snippet? - ruby

I have a string containing some text. The text may or may not be code. Using Github's Linguist, I have been able to detect the likely programming language only if I give it a list of candidates.
# test_linguist_1.rb
#!/usr/bin/env ruby
require 'linguist'
s = "int main(){}"
candidates = [Linguist::Language["Python"], Linguist::Language["C"], Linguist::Language["Ruby"]]
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect
Execution:
$ ./test_linguist_1.rb
[#<Linguist::Language name=C>, #<Linguist::Language name=Python>, #<Linguist::Language name=Ruby>]
Notice that I gave it a list of candidates. How can I avoid having to define a list of candidates?
I tried the following:
# test_linguist_2.rb
#!/usr/bin/env ruby
require 'linguist'
s = "int main(){}"
candidates = Linguist::Language.all
# I also tried only Popular
# candidates = Linguist.Language.popular
b = Linguist::Blob.new('', s)
langs = Linguist::Classifier.call(b, candidates)
puts langs.inspect
Execution:
$ ./test_linguist_2.rb
/home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:131:in `token_probability': undefined method `[]' for nil:NilClass (NoMethodError)
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:120:in `block in tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `inject'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:119:in `tokens_probability'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:105:in `block in classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `each'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:104:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:78:in `classify'
from /home/marvelez/.rvm/gems/ruby-2.2.1/gems/github-linguist-4.8.9/lib/linguist/classifier.rb:20:in `call'
from ./test_linguist.rb:21:in `block in <main>'
from ./test_linguist.rb:14:in `each'
from ./test_linguist.rb:14:in `<main>'
Additional:
Is this the best way to use Github Linguist? FileBlob is an alternative to Blob but this requires writing my string to a file. This is problematic for two reasons 1) it is slow, and 2) the chosen file extension then guides linguist and we do not know the correct file extension.
Are there better tools to do this? Github Linguist perhaps works well over files but not over strings.

Taking a quick look at the source code of Linguist, it appears to use a number of strategies to determine the language, and it calls each strategy in turn. Classifier is the last strategy to be called, by which time it has (hopefully) picked up language "candidates" (as you've discovered for yourself) from the prior strategies. So I think for the particular sample you've shared with us, you have to pass a filename of some kind, even if a file doesn't actually exist, or a list of language candidates. If neither is an option for you, this may not be a feasible solution for your problem.
$ ruby -r linguist -e 'p Linguist::Blob.new("foo.c", "int main(){}").language'
#<Linguist::Language name=C>
It returns nil without a filename, and #<Linguist::Language name=C++> with "foo.cc" and the same code sample.
The good news is that you picked a really bad sample to test with. :-) Other strategies look at modelines and shebangs, so more complex samples have a better chance at succeeding. Take a look at these:
$ ruby -r linguist -e 'p Linguist::Blob.new("", "#!/usr/bin/env perl
print q{Hello, world!};
").language'
#<Linguist::Language name=Perl>
$ ruby -r linguist -e 'p Linguist::Blob.new("", "# vim: ft=ruby
puts %q{Hello, world!}
").language'
#<Linguist::Language name=Ruby>
However, if there isn't a shebang or a modeline, we're still out of luck. It turns out that there's a training dataset that is computed and serialized to disk at install time, and automatically loaded during language detection. Unfortunately, I think there's a bug in the library that is preventing this training dataset from being used if there aren't any candidates by the time it gets to this step. Fixing the bug lets me do this:
$ ruby -Ilib -r linguist -e 'p Linguist::Blob.new("", "int main(){}").language'
#<Linguist::Language name=XC>
(I don't know what XC is, but adding some other tokens to the string such as #include <stdio.h> or int argc, char* argv[] gives C. I'm sure most of your samples will have more meat to analyze.)
It's a real simple fix and I've submitted a PR for it. You can use my fork of the Gem if you'd like in the meantime. Otherwise, we'll need to look into using Linguist::Classify directly, as you've started exploring, but that has the potential to get messy.
To use my fork, add/modify your Gemfile to read as such:
gem 'github-linguist',
require: 'linguist',
git: 'https://github.com/mwpastore/linguist.git',
branch: 'fix-no-candidates'
I'll try to come back and update this answer when the PR has been merged and a new version of the Gem has been released with the fix. If I have to do any force-pushes to meet the repository guidelines and/or make the maintainers happy, you may have to do a bundler update to reflect the changes. Let me know if you have any questions.

Taking another quick look at Linguist source, Linguist::Language.all seems to be what you're looking for.
EDIT: Tried the Linguist::Language.all myself. The failure is due to yet another bug: some languages seem to have faulty data. For example, this also fails:
candidates = [Linguist::Language['ADA']]
This apparently because of the fact that in lib/linguist/samples.json, tokens.ADA doesn't exist. It is not the only such language.
To avoid the bug, you can filter the languages:
non_buggy_languages = Linguist::Samples.cache['tokens'].keys
candidates = non_buggy_languages.map { |l| Linguist::Language[l] }

Related

Filter ruby warnings

I want to create a gem to filter warnings in Ruby, and I'd like to do this for "syntax" and "runtime" warnings. I am struggling to work out how its possible to filter the syntax level warnings (or if this possible)
For example, if I run the following script
# myscript.rb
#blah
with ruby -w myscript.rb
myscript.rb:1: warning: possibly useless use of a variable in void context
myscript.rb:1: warning: instance variable #blah not initialized
Now, imagine this is part of a larger project. I would like to filter out any warnings from myscript. How would I go about doing this? Runtime errors would be easy to filter using silence_warnings style code from ActiveSupport https://github.com/rails/rails/blob/3be9a34e78835a8dafc3438f60afb412613773b9/activesupport/lib/active_support/core_ext/kernel/reporting.rb
But I don't know how (or if it's possible) to hook into Rubys syntax level warnings, as it seems to be they'd be run before you have the chance to monkey patch anything. All I can think of is to wrap the ruby script in another process which will filter all the warnings. For example:
myfilterprogram ruby -w myscript.rb which would then catch anything printed to STDERR and filter accordingly.

You may not be able to monkey patch before the main file is read, but you can make your main file call subfiles after doing monkeypatching.
myruby (executable)
#!/usr/bin/env ruby
module Kernel
def warn *args
args # => captured warnings
end
end
load ARGV[0]
Usage is:
myruby foo.rb

Error: undefined method "each" for String when running elastic-mapreduce specifying distributed cache file

I've got the following error:
Error: undefined method `each' for "s3n://dico-count-words/Cache/dicoClazz.p#dicoClazzCache.p":String
When I run the following command line to launch a mapreduce algorithm on Amazon EMR cluster via elastic-mapreduce, specifying a distributed cache file:
./elastic-mapreduce --create --stream \
> --input s3n://dico-count-words/Words \
> --output s3n://dico-count-words/Output \
> --mapper s3n://dico-count-words/Code/mapper.py \
> --reducer s3n://dico-count-words/Code/reducer.py \
> --log-uri s3n://dico-count-words/Logs \
> --cache s3n://dico-count-words/Cache/dicoClazz.p#dicoClazz.p
I followed the instructions I found here.
I haven't had any issue running similar commands to create others clusters which didn't need distributed cache file. I have also managed to run this very job using the AWS console. But I would prefer to do it via the CLI.
I think it might be an issue with ruby similar to this one. But I don't know aything about ruby so it's just a guess. It's also the first time that I use AWS and so elastic-mapreduce.
For your information, this is the version of ruby I have:
ruby 2.0.0p451 (2014-02-24 revision 45167) [universal.x86_64-darwin13]
Do you have any ideas about where that error is coming from? Any suggestions to fix it?
Many thanks.

each is not available for the ruby String class.
As an example, let's look at the following:
x = "test"
x.each {|character| puts character}
>>>NoMethodError: undefined method `each' for "test":String
This is what you're seeing in your code, and that's to be expected. Open irb in your terminal and try the following:
2.0.0-p247 :001 > x = "test"
2.0.0-p247 :002 > x.each <now hit tab twice>
2.0.0-p247 :003 > x.each_
x.each_byte x.each_char x.each_codepoint x.each_line
You should see something close to the above. I happen to be using ruby 2.0.0-p247. Your version may differ.
irb supports auto complete. Here we started typing each and irb is making suggestions based on available options that start with each. As you can see, just plain each is not in fact an option. Instead, think of a string as an array of characters. That is "test" can be thought of like ["t", "e", "s", "t"]. Given this representation, it's obvious what x.each_char will do - it will yield each character in the string. Thus:
x.each_char {|c| puts c}
will of course print each character of the string.
If we can see some of the code and data moving through it, it would be easier to suggest a solution for you. However, the above explanation is the reason you are throwing an error when calling .each on a String.
I hope this is helpful.

In a Ruby / an IRB, how can I use the yard documentation gem to interact with the code's stats and parsed information?

The yard gem is a tool for generating docs of ruby code.
It's done via the command line, and the docs get generated.
However I was wondering if it's possible to interact with the parsed code and statistic via an IRB.
You can go into the IRB and call up yard like this:
require 'yard'
YARD
However I can't seem to interact with the code or get and parsed stats. For example getting the list of methods in the code would be great, or a object of method lists via the parser.
Docs: (http://rubydoc.info/gems/yard/YARD)

See documentation on the YARD::Registry (linked to from the Architecture Overview document) for accessing code objects that got parsed out after you ran yard.
Example for printing all methods in your registry:
require 'yard'
YARD::Registry.load!
puts YARD::Registry.all(:method).map(&:path)
Advanced class-- getting all undocumented objects:
require 'yard'
YARD::Registry.load!
puts YARD::Registry.all.select {|o| o.docstring.blank? }.map(&:path)
You can see more properties of CodeObjects in the CodeObjects Architecture Overview in addition to the YARD::CodeObjects::Base class API docs. That will give you more information on what you can query. You may also want to look at the YARD::Docstring class if you plan on introspecting tags.
Note that if you want to actually generate the registry (you haven't yet run yard), you can do so with the Registry class, but it would probably be better to use YARD::CLI::Yardoc.run for that.

Looking to the yard source it seems like the gem is not meant to be use with that purpose but you can extract some info.
require 'yard'
stats = YARD::CLI::Stats.new
stats.run(files) # it allows patter matching, for instance 'lib/*.rb'
stats.all_objects # all the objects recognized by the parser
stats.all_objects.select{|o| o.type == :method} # to get only methods
stats.all_objects.select{|o| o.type == :class} # to get only classes
I'm not sure if that is enough for you but I don't think you can get information in a deeper level.

It really depends on what you need to do, but I strongly suggest you to take a look at pry that lets you do great things:
[1] pry(main)> require 'cgi'
=> true
[2] pry(main)> show-method CGI::escape
From: /home/carlesso/.rbenv/versions/2.1.2/lib/ruby/2.1.0/cgi/util.rb # line 7:
Owner: CGI::Util
Visibility: public
Number of lines: 6
def escape(string)
encoding = string.encoding
string.b.gsub(/([^ a-zA-Z0-9_.-]+)/) do |m|
'%' + m.unpack('H2' * m.bytesize).join('%').upcase
end.tr(' ', '+').force_encoding(encoding)
end
and even more strange stuff:
[4] pry(main)> cd CGI
[5] pry(CGI):1> ls
constants:
Cookie CR EOL HtmlExtension HTTP_STATUS InvalidEncoding LF MAX_MULTIPART_COUNT MAX_MULTIPART_LENGTH NEEDS_BINMODE PATH_SEPARATOR QueryExtension REVISION Util
Object.methods: yaml_tag
CGI::Util#methods:
escape escapeElement escapeHTML escape_element escape_html h pretty rfc1123_date unescape unescapeElement unescapeHTML unescape_element unescape_html
CGI.methods: accept_charset accept_charset= parse
CGI#methods: accept_charset header http_header nph? out print
class variables: ##accept_charset
locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_
you can also edit something, like edit CGI::escape will open your $EDITOR to the relevant file/line (in my case, vim will be opened at .rbenv/versions/2.1.2/lib/ruby/2.1.0/cgi/util.rb line 7
Where present will show the help:
[10] pry(CGI):1> help Pry.hist
Usage: hist [--head|--tail]
hist --all
hist --head N
hist --tail N
hist --show START..END
hist --grep PATTERN
hist --clear
hist --replay START..END
hist --save [START..END] FILE
Aliases: history
Show and replay Readline history.
-a, --all Display all history
-H, --head Display the first N items
-T, --tail Display the last N items
-s, --show Show the given range of lines
-G, --grep Show lines matching the given pattern
-c, --clear Clear the current session's history
-r, --replay Replay a line or range of lines
--save Save history to a file
-e, --exclude-pry Exclude Pry commands from the history
-n, --no-numbers Omit line numbers
-h, --help Show this message.
But, again, it really depends on your needs, a little bit of "metaprogramming" can help you, like .methods, .instance_variables, .constants can be useful..

Running ruby gem sprockets from command line

I am finding very little documentation on running sprockets from the command line.
Does anyone know how to setup the .sprocketsrc file?
Examples would be great especially on how to configure the minification.

If you read directly the source, you can see there https://github.com/sstephenson/sprockets/blob/master/bin/sprockets#L8 that it uses something named Shellwords which comes with the standard ruby library : http://www.ruby-doc.org/stdlib-1.9.3/libdoc/shellwords/rdoc/Shellwords.html and http://www.ruby-doc.org/stdlib-1.9.3/libdoc/shellwords/rdoc/Shellwords.html#method-c-shellsplit
So we can guess from :
unless ARGV.delete("--noenv")
if File.exist?(path = "./.sprocketsrc")
rcflags = Shellwords.split(File.read(path))
ARGV.unshift(*rcflags)
end
end
That it basically prepends whatever it find in the sprocketsrc to the command line arguments.
https://github.com/sstephenson/sprockets/blob/master/bin/sprockets#L22 gives us the list of the options, meaning if you want to configure the minification you can create a .sprocketsrc
with something like
--include=assets/javascripts --output build/assets/javascripts
Sadly, the command line don't look to have any option to configure the minifying options.

Telling rspec to not load files

I'm trying to add some commit hooks to my git repo. I want to leverage Rspec and create commit message specs that will run each time I commit. I have figured out how to run rspec outside of the 'spec' command, but I now have an interesting problem.
Here is my current code:
.git/hooks/commit-msg
#!/usr/bin/env ruby
require 'rubygems'
require 'spec/autorun'
message = File.read(ARGV[0])
describe "failing" do
it "should fail" do
true.should == false
end
end
This is throwing an error when it gets to the describe call. Basically, it thinks that the commit message it receives is the file to load and run the specs against. Here is the actually error
./.git/COMMIT_EDITMSG:1: undefined local variable or method `commit-message-here' for main:Object (NameError)
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner/example_group_runner.rb:15:in `load'
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner/example_group_runner.rb:15:in `load_files'
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner/example_group_runner.rb:14:in `each'
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner/example_group_runner.rb:14:in `load_files'
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner/options.rb:133:in `run_examples'
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner.rb:61:in `run'
from /Users/roykolak/.gem/ruby/1.8/gems/rspec-1.3.0/lib/spec/runner.rb:45:in `autorun'
from .git/hooks/commit-msg:12
I am looking for a way to tell rspec to not load files. I have a suspicion that I will need to create my own spec runner. I came to this conclusion after viewing these lines in rspec-1.3.0/lib/spec/runner/example_group_runner.rb
def load_files(files)
$KCODE = 'u' if RUBY_VERSION.to_f < 1.9
# It's important that loading files (or choosing not to) stays the
# responsibility of the ExampleGroupRunner. Some implementations (like)
# the one using DRb may choose *not* to load files, but instead tell
# someone else to do it over the wire.
files.each do |file|
load file
end
end
But, I would like some feedback before I do that. Any thoughts?

Do you even really need all the special stuff that RSpec provides (should and the various matchers) just to verify the contents of a single file? It really seems like overkill for the problem.
spec/autorun eventually calls Spec::Runner.autorun which parses ARGV as if it held normal arguments for a spec command line.
When you install a bare “spec” file as a Git hook,
it will get arguments that are appropriate for the whatever Git hook is being used,
not spec-style arguments (spec filenames/directories/patterns and spec options).
You might be able to hack around the problem like this:
# Save original ARGV, replace its elements with spec arguments
orig_argv = ARGV.dup
%w(--format nested).inject(ARGV.clear, :<<)
require 'rubygems'
require 'spec/autorun'
# rest of your code/spec
# NOTE: to refer to the Git hook arguments use orig_argv instead of ARGV

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How can I detect the programming language of a snippet? - ruby

Related

Filter ruby warnings

Error: undefined method "each" for String when running elastic-mapreduce specifying distributed cache file

In a Ruby / an IRB, how can I use the yard documentation gem to interact with the code's stats and parsed information?

Running ruby gem sprockets from command line

Telling rspec to not load files

Categories

Resources