Ruby - Matching Twitter URL from any html page using Regex - ruby

I am trying to fetch the Twitter URL from this page for instance; however, my result is nil. I am pretty sure my regex is not too bad, but my code fails. Here is it :
doc = `(curl --url "http://www.rabbitreel.com/")`
twitter_url = ("/^(?i)[http|https]+:\/\/(?i)[twitter]+\.(?i)(com)\/?\S+").match(doc)
puts twitter_url
# => nil
Maybe, I misused regex syntax. My initial idea was simple: I wanted to match a regular Twitter url structure. I even tried http://rubular.com to test my regex, and it seemed to be fine when I entered a Twitter url.

http://ruby-doc.org/core-2.2.0/String.html#method-i-match
tells you that the object you're calling match on should be the string you're parsing, and the parameter should be the regex pattern. So if anything, you should call :
doc.match("/^(?i)[http|https]+:\/\/(?i)[twitter]+\.(?i)(com)\/?\S+")
I prefer
doc[/your_regex/]
syntax, because it directly delivers a String, and not a MatchData, which needs another step to get the information out of.
For Regexen, I always try to begin as simple as possible
[3] pry(main)> doc[/twitter/]
=> "twitter"
[4] pry(main)> doc[/twitter\.com/]
=> "twitter.com"
[5] pry(main)> doc[/twitter\.com\//]
=> "twitter.com/"
[6] pry(main)> doc[/twitter\.com\/\//] #OOPS. One \/ too many
=> nil
[7] pry(main)> doc[/twitter\.com\//]
=> "twitter.com/"
[8] pry(main)> doc[/twitter\.com\/\S+/]
=> "twitter.com/rabbitreel\""
[9] pry(main)> doc[/twitter\.com\/[^"]+/]
=> "twitter.com/rabbitreel"
[10] pry(main)> doc[/http:\/\/twitter\.com\/[^"]+/]
=> nil
[11] pry(main)> doc[/https?:\/\/twitter\.com\/[^"]+/]
=> "https://twitter.com/rabbitreel"
[12] pry(main)> doc[/https?:\/\/twitter\.com\/[^" ]+/]
=> "https://twitter.com/rabbitreel"
[13] pry(main)> doc[/https?:\/\/twitter\.com\/\w+/] #DONE
=> "https://twitter.com/rabbitreel"
EDIT:
Sure, Regexen cannot parse an entire HTML document.
Here, we only want to find the first occurence of a Twitter URL. So, depending on the requirements, on possible input and the chosen platform, it could make sense to use a Regexp.
Nokogiri is a huge gem, and it might not be possible to install it.
Independently from this fact, it would be a very good idea to check that the returned String really is a correct Twitter URL.
I think this Regexp:
/https?:\/\/twitter\.com\/\w+/
is safe.
[31] pry(main)> malicious_doc = "https://twitter.com/userid#maliciouswebsite.com"
=> "https://twitter.com/userid#maliciouswebsite.com"
[32] pry(main)> malicious_doc[/https?:\/\/twitter\.com\/\w+/]
=> "https://twitter.com/userid"
Using Nokogiri doesn't prevent you from checking for malicious input.
The proposed solution from #mudasobwa is interesting, but isn't safe yet:
[33] pry(main)> Nokogiri::HTML('<html><body>Link</body></html>').css('a').map { |e| e.attributes.values.first.value }.select {|e| e =~ /twitter.com/ }
=> ["http://maliciouswebsitethatisnottwitter.com/"]

NB as of Nov 2021, rabbitreel.com domain is on sale, so please read the comments about the possibility of it’s serving malicious content.
One should never use regexps to parse HTML and here is why.
Below is a robust solution using Nokogiri HTML parsing library:
require 'nokogiri'
doc = Nokogiri::HTML(`(curl --url "http://www.rabbitreel.com/")`)
doc.css('a').map { |e| e.attributes.values.first.value }
.select {|e| e =~ /twitter.com/ }
#⇒ [
# [0] "https://twitter.com/rabbitreel",
# [1] "https://twitter.com/rabbitreel"
# ]
Or, alternatively, with xpath:
require 'nokogiri'
doc = Nokogiri::HTML(`(curl --url "http://www.rabbitreel.com/")`)
doc.xpath('//a[contains(#href, "twitter.com")]')
.map { |e| e.attributes['href'].value }

Related

Ruby Twitter, Retrieving Full Tweet Text

I'm using the Ruby Twitter gem to retrieve the full text of a tweet.
I first tried this, and as you can see the text was truncated.
[5] pry(main)> t = client.status(782845350918971393)
=> #<Twitter::Tweet id=782845350918971393>
[6] pry(main)> t.text
=> "A #Gameofthrones fan? Our #earlybird Dublin starter will get you
touring the GOT location in 2017
#traveldealls… (SHORTENED URL WAS HERE)"
Then I tried this:
[2] pry(main)> t = client.status(782845350918971393, tweet_mode: 'extended')
=> #<Twitter::Tweet id=782845350918971393>
[3] pry(main)> t.full_text
=>
[4] pry(main)> t.text
=>
Both the text and full text are empty when I use the tweet_mode: 'extended' option.
I also tried editing the bit of the gem that makes the request, the response was the same.
perform_get_with_object("/1.1/statuses/show/#{extract_id(tweet)}.json?tweet_mode=extended", options, Twitter::Tweet)
Any help would be greatly appreciated.
Here's a workaround I found helpful:
Below is the way I am handling this issue ATM. Seems to be working. I am using both Streaming (with Tweetstream) and REST APIs.
status = #client.status(1234567890, tweet_mode: "extended")
if status.truncated? && status.attrs[:extended_tweet]
# Streaming API, and REST API default
t = status.attrs[:extended_tweet][:full_text]
else
# REST API with extended mode, or untruncated text in Streaming API
t = status.attrs[:text] || status.attrs[:full_text]
end
From https://github.com/sferik/twitter/pull/848#issuecomment-329425006

How does File.expand_path work?

I'm not sure I understand the order in which operations are done with File.expand_path. Below is an example pry session:
[1] pry(main)> File.expand_path('.')
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie"
[2] pry(main)> File.expand_path('..')
=> "/Users/max/Dropbox/work/src/github.com/mbigras"
[3] pry(main)> File.expand_path('..', "cats")
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie"
[4] pry(main)> File.expand_path('..', __FILE__)
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie"
[5] pry(main)> File.expand_path('../lib', __FILE__)
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie/lib"
[7] pry(main)> File.expand_path('./lib')
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie/lib"
[8] pry(main)> File.expand_path('./lib', __FILE__)
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie/(pry)/lib"
[9] pry(main)>
[1] makes sense, I'm expanding the path of the current working directory.
[2] makes sense, I'm expanding the path of the parent directory of the cwd
[3] doesn't make sense, I accept from reading another answer that for some reason ruby implicitly takes the File.dirname of the second arg and in the case of File.dirname('cats') it expands to the cwd . because 'cats' isn't nested. But then why doesn't File.expand_path('..', '.') have the same result?
[18] pry(main)> File.expand_path('..', 'cats')
=> "/Users/max/Dropbox/work/src/github.com/mbigras/foobie"
[19] pry(main)> File.dirname('cats')
=> "."
[20] pry(main)> File.expand_path('..', '.')
=> "/Users/max/Dropbox/work/src/github.com/mbigras"
[4] doesn't make sense but for the same reason as [3]. In this case the "random string" is "(pry)" because p __FILE__ #=> "(pry)" while inside a pry session.
[5] doesn't make sense, why would File.expand_path go to seemingly noone's parent directory and then magically come back to the cwd and decide to go into lib
[7] makes sense, but doesn't help me understand [5]
[8] doesn't make sense, why is the "random string" now wedged between the cwd . and lib
From the docs:
File.expand_path("../../lib/mygem.rb", __FILE__)
#=> ".../path/to/project/lib/mygem.rb"
So first it resolves the parent of __FILE__, that is bin/, then go to the parent, the root of the project and appends lib/mygem.rb.
The order of operations doesn't really add up to me.
Take File.dirname(__FILE__)
Go to the parent which is the root
append lib/mygem.rb
Steps 2 and 3 don't help. Why are we going to the parent? Why did we even do Step 1 in the first place? Why is there ../..? Doesn't that mean go two levels up from the current working directory?
Would love some guiding principles to understand these examples.
Edit to add the Gold:
File.expand_path goes to the first parameter from the directory specified by the second parameter (Dir.pwd if not present). - Eric Duminil
Theory
I think you missed a .. while reading the answer you link to.
No File.dirname is ever done implicitely by expand_path.
File.expand_path('../../Gemfile', __FILE__)
# ^^ Note the difference between here and
# vv there
File.expand_path('../Gemfile', File.dirname(__FILE__))
What confused me at first was that the second parameter is always considered to be a directory by File.expand_path, even if it doesn't exist, even if it looks like a file or even if it is an existing file.
File.expand_path goes to the first parameter from the directory specified by the second parameter (Dir.pwd if not present).
Examples
[3]
File.expand_path('..', "cats")
It is executed in the current directory, which is "/Users/max/Dropbox/work/src/github.com/mbigras/foobie".
Is cats an existing file, an existing directory or a non-existent directory?
It doesn't matter to File.expand_path : "cats" is considered to be an existing directory, and File.expand_path starts inside it.
This command is equivalent to launching :
File.expand_path('..')
inside the "/Users/max/Dropbox/work/src/github.com/mbigras/foobie/cats" directory.
So expand_path goes back one directory, and lands back to :
"/Users/max/Dropbox/work/src/github.com/mbigras/foobie"
[4]
Same thing. It is equivalent to File.expand_path('..') from the (probably) non-existing :
"/Users/max/Dropbox/work/src/github.com/mbigras/foobie/(pry)"
So it is :
"/Users/max/Dropbox/work/src/github.com/mbigras/foobie"
[5]
Going from [4], it just goes to the subfolder lib.
[8]
Starting from "/Users/max/Dropbox/work/src/github.com/mbigras/foobie/(pry)"
, it just goes to the subfolder lib.
Once again, File.expand_path never checks if the corresponding folders and subfolders exist.

chef 11: any way to turn attributes into a ruby hash?

I'm generating a config for my service in chef attributes. However, at some point, I need to turn the attribute mash into a simple ruby hash. This used to work fine in Chef 10:
node.myapp.config.to_hash
However, starting with Chef 11, this does not work. Only the top-level of the attribute is converted to a hash, with then nested values remaining immutable mash objects. Modifying them leads to errors like this:
Chef::Exceptions::ImmutableAttributeModification
------------------------------------------------ Node attributes are read-only when you do not specify which precedence level to set. To
set an attribute use code like `node.default["key"] = "value"'
I've tried a bunch of ways to get around this issue which do not work:
node.myapp.config.dup.to_hash
JSON.parse(node.myapp.config.to_json)
The json parsing hack, which seems like it should work great, results in:
JSON::ParserError
unexpected token at '"#<Chef::Node::Attribute:0x000000020eee88>"'
Is there any actual reliable way, short of including a nested parsing function in each cookbook, to convert attributes to a simple, ordinary, good old ruby hash?
after a resounding lack of answers both here and on the opscode chef mailing list, i ended up using the following hack:
class Chef
class Node
class ImmutableMash
def to_hash
h = {}
self.each do |k,v|
if v.respond_to?('to_hash')
h[k] = v.to_hash
else
h[k] = v
end
end
return h
end
end
end
end
i put this into the libraries dir in my cookbook; now i can use attribute.to_hash in both chef 10 (which already worked properly and which is unaffected by this monkey-patch) and chef 11. i've also reported this as a bug to opscode:
if you don't want to have to monkey-patch your chef, speak up on this issue:
http://tickets.opscode.com/browse/CHEF-3857
Update: monkey-patch ticket was marked closed by these PRs
I hope I am not too late to the party but merging the node object with an empty hash did it for me:
chef (12.6.0)> {}.merge(node).class
=> Hash
I had the same problem and after much hacking around came up with this:
json_string = node[:attr_tree].inspect.gsub(/\=\>/,':')
my_hash = JSON.parse(json_string, {:symbolize_names => true})
inspect does the deep parsing that is missing from the other methods proposed and I end up with a hash that I can modify and pass around as needed.
This has been fixed for a long time now:
[1] pry(main)> require 'chef/node'
=> true
[2] pry(main)> node = Chef::Node.new
[....]
[3] pry(main)> node.default["fizz"]["buzz"] = { "foo" => [ { "bar" => "baz" } ] }
=> {"foo"=>[{"bar"=>"baz"}]}
[4] pry(main)> buzz = node["fizz"]["buzz"].to_hash
=> {"foo"=>[{"bar"=>"baz"}]}
[5] pry(main)> buzz.class
=> Hash
[6] pry(main)> buzz["foo"].class
=> Array
[7] pry(main)> buzz["foo"][0].class
=> Hash
[8] pry(main)>
Probably fixed sometime in or around Chef 12.x or Chef 13.x, it is certainly no longer an issue in Chef 15.x/16.x/17.x
The above answer is a little unnecessary. You can just do this:
json = node[:whatever][:whatever].to_hash.to_json
JSON.parse(json)

How can I parse json and write that data to a database using Sinatra and DataMapper

I'm doing a proof of concept thing here and having a bit more trouble than I thought I was going to. Here is what I want to do and how I am currently doing it.
I am sending my Sinatra app a json file which contains the simple message below.
[
{
title: "A greeting!",
message: "Hello from the Chairman of the Board"
}
]
From there I have a post which I am using to take the params and write them to sqlite database
post '/note' do
data = JSON.parse(params) #<---EDIT - added, now gives error.
#note = Note.new :title => params[:title],
:message => params[:message],
:timestamp => (params[:timestamp] || Time.now)
#note.save
end
When I send the message the timestamp and the id are saved to the database however the title and message are nil.
What am I missing?
Thanks
Edit:
Now when I run my app and send it the json file I get this error:
C:/Users/Norm/ruby/Ruby192/lib/ruby/1.9.1/webrick/server.rb:183:in `block in start_thread'
TypeError: can't convert Hash into String
Edit 2: Some success.
I have the above json in a file call test.json which is the way the json will be posted. In order to post the file I used HTTPClient:
require 'httpclient'
HTTPClient.post 'http://localhost:4567/note', [ :file => File.new('.\test.json') ]
After thinking about it some more, I thought posting the file was the problem so I tried sending it a different way. The example below worked once I changed n my post /note handle to this:
data = JSON.parse(request.body.read)
My new send.rb
require 'net/http'
require 'rubygems'
require 'json'
#host = 'localhost'
#port = '4567'
#post_ws = "/note"
#payload ={
"title" => "A greeting from...",
"message" => "... Sinatra!"
}.to_json
def post
req = Net::HTTP::Post.new(#post_ws, initheader = {'Content-Type' =>'application/json'})
#req.basic_auth #user, #pass
req.body = #payload
response = Net::HTTP.new(#host, #port).start {|http| http.request(req) }
puts "Response #{response.code} #{response.message}:
#{response.body}"
end
thepost = post
puts thepost
So I am getting closer. Thanks for all the help so far.
Sinatra won't parse the JSON automatically for you, but luckily parsing JSON is pretty straightforward:
Start with requiring it as usual. require 'rubygems' if you're not on Ruby 1.9+:
>> require 'json' #=> true
>> a_hash = {'a' => 1, 'b' => [0, 1]} #=> {"a"=>1, "b"=>[0, 1]}
>> a_hash.to_json #=> "{"a":1,"b":[0,1]}"
>> JSON.parse(a_hash.to_json) #=> {"a"=>1, "b"=>[0, 1]}
That's a roundtrip use to create, then parse some JSON. The IRB output shows the hash and embedded array were converted to JSON, then parsed back into the hash. You should be able to break that down for your nefarious needs.
To get the fields we'll break down the example above a bit more and pretend that we've received JSON from the remote side of your connection. So, the received_json below is the incoming data stream. Pass it to the JSON parser and you'll get back a Ruby data hash. Access the hash as you would normally and you get the values:
>> received_json = a_hash.to_json #=> "{"a":1,"b":[0,1]}"
>> received_hash = JSON.parse(received_json) #=> {"a"=>1, "b"=>[0, 1]}
>> received_hash['a'] #=> 1
>> received_hash['b'] #=> [0, 1]
The incoming JSON is probably a parameter in your params[] hash but I am not sure what key it would be hiding under, so you'll need to figure that out. It might be called 'json' or 'data' but that's app specific.
Your database code looks ok, and must be working if you're seeing some of the data written to it. It looks like you just need to retrieve the fields from the JSON.

regex to remove the webpage part of a url in ruby

I am trying to remove the webpage part of the URL
For example,
www.example.com/home/index.html
to
www.example.com/home
any help appreciated.
Thanks
It's probably a good idea not to use regular expressions when possible. You may summon Cthulhu. Try using the URI library that's part of the standard library instead.
require "uri"
result = URI.parse("http://www.example.com/home/index.html")
result.host # => www.example.com
result.path # => "/home/index.html"
# The following line is rather unorthodox - is there a better solution?
File.dirname(result.path) # => "/home"
result.host + File.dirname(result.path) # => "www.example.com/home"
If your heart is set on using regex and you know that your URLs will be pretty straight forward you could use (.*)/.* to capture everything before the last / in your URL.
irb(main):007:0> url = "www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):008:0> regex = "(.*)/.*"
=> "(.*)/.*"
irb(main):009:0> url =~ /#{regex}/
=> 0
irb(main):010:0> $1
=> "www.example.com/home"
irb(main):001:0> url="www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):002:0> url.split("/")[0..-2].join("/")
=> "www.example.com/home"

Resources