Ruby JSON extractor failing, possibly due to overly large JSON - ruby

I was in the process of creating a script to extract all of the comments from a Reddit Thread as a JSON:
require "rubygems"
require "json"
require "net/http"
require "uri"
require 'open-uri'
require 'neatjson'
#The URL.
url = ("https://www.reddit.com/r/AskReddit/comments/46n0zc.json")
#Sets up the JSON reader.
result = JSON.parse(open(url).read)
children = result["data"]["children"]
#Prints the jsons.
children.each do |child|
puts "Author: " + child["data"]["author"]
puts "Body: " + child["data"]["body"]
puts "ID: " + child["data"]["id"]
puts "Upvotes: " + child["data"]["ups"].to_s
puts ""
end
And for some reason it gives me an error. However, the error is not in the actual JSON printer, but in the reader:
005----extractallredditpostcomments.rb:17:in `[]': no implicit conversion of String into Integer (TypeError)
from 005----extractallredditpostcomments.rb:17:in `<main>'
For some reason,
children = result["data"]["children"]
Isn't working, which is strange because it worked fine yesterday
What I'm wondering is: Could this be causes by the size of the JSON? If you actually go to the link (https://www.reddit.com/r/AskReddit/comments/46n0zc.json) you can see that the file is huge. I'm having so much trouble finding the tags I need due to the sheer size of the page, it took me hours and I'm still not sure I have the correct ones, that could be causing the error as well. I'm not sure what's failing here.
Oh, and one last thing: I tried simplifying the program by removing the printer:
#Sets up the JSON reader.
result = JSON.parse(open(url).read)
children = result["data"]["children"]
puts children
#Prints the jsons.
#children.each do |child|
# puts "Author: " + child["data"]["author"]
# puts "Body: " + child["data"]["body"]
# puts "ID: " + child["data"]["id"]
# puts "Upvotes: " + child["data"]["ups"].to_s
# puts ""
#end
And it still fails:
005----extractallredditpostcomments.rb:13:in `[]': no implicit conversion of String into Integer (TypeError)
from 005----extractallredditpostcomments.rb:13:in `<main>'

A quick look at the returned JSON value shows that it is a JSON array of two JSON objects and not a JSON object. It looks somewhat like this:
[
{
"data": {
"after": null,
"before": null,
"children": [
{
"data": {
"approved_by": null,
"archived": false,
...
},
"kind": "Listing"
},
{
"data": {
"after": null,
"before": null,
"children": [
{
"data": {
"approved_by": null,
"archived": false,
"author": "finkledinkle7",
"author_flair_css_class": null,
"author_flair_text": null,
"banned_by": null,
"body": "My mother was really sick in 2008. I was turning 25 with a younger brother and sister.\n\nLost both of my grandparents on mom's side to cancer a few years prior. Mom had to watch as her parents slowly passed away. It destroyed her not having her mother around as t ...
}
]
This means that the line children = result["data"]["children"] in your program won't work because it is treating result as a JSON object. It looks like you should do children = result[1]["data"]["children"].

Related

Why is my ruby class behaving unexpectedly?

I have the following ruby class:
class Scheme
attr_reader :id, :uri, :labels, :in_schemes, :top_concepts
def initialize(id, uri, labels, in_schemes)
#id = id,
#uri = uri,
#labels = labels
#in_schemes = in_schemes
#top_concepts = Array.new
end
end
And I have the following method that traverses the given directory looking for files (they have names like "01", "01.01.00", etc.) containing a series of language-specific category labels (Sample below):
def make_schemes(catdir)
concept_schemes = Array.new
Dir.foreach(catdir) do |file|
unless file == "." || file == ".."
id = file.gsub(/\./,"").to_s
uri = "/unbist/scheme/#{id}"
labels = Array.new
in_schemes = Array.new
File.read("#{catdir}/#{file}").split(/\n/).each do |line|
label = JSON.parse(line)
labels << label
end
if id.size > 2
in_schemes = ["/unbist","/unbist/#{id[0..1]}"]
else
in_schemes = ["/unbist"]
end
p "Making new concept scheme with id: #{id}"
concept_scheme = Scheme.new(id, uri, labels, in_schemes)
p concept_scheme
concept_schemes << concept_scheme
end
end
return concept_schemes
end
Sample category file, named "#{dir}/01". Each line is proper JSON, but the whole file, for reasons beyond the scope of this question, is not.
{ "text": "ﻢﺳﺎﺌﻟ ﻕﺎﻧﻮﻨﻳﺓ ﻮﺴﻳﺎﺴﻳﺓ", "language": "ar" }
{ "text": "政治和法律问题", "language": "zh" }
{ "text": "political and legal questions", "language": "en" }
{ "text": "questions politiques et juridiques", "language": "fr" }
{ "text": "ПОЛИТИЧЕСКИЕ И ЮРИДИЧЕСКИЕ ВОПРОСЫ", "language": "ru" }
{ "text": "cuestiones politicas y juridicas", "language": "es" }
The output I am getting is strange. The id variable in the make_schemes method is set properly prior to constructing the new Scheme, but the Scheme initializer seems to be confused somewhere and is applying the entire set of variables to the object's id variable. Here is some output for the above sample (cleaned newlines added for readability:
"Making new concept scheme with id: 01"
#<Scheme:0xa00ffc8
#uri="/scheme/01",
#labels=[{"text"=>"مسائل قانونية وسياسية", "language"=>"ar"}, {"text"=>"政治和法律问题", "language"=>"zh"}, {"text"=>"political and legal questions", "language"=>"en"}, {"text"=>"questions politiques et juridiques", "language"=>"fr"}, {"text"=>"ПОЛИТИЧЕСКИЕ И ЮРИДИЧЕСКИЕ ВОПРОСЫ", "language"=>"ru"}, {"text"=>"cuestiones politicas y juridicas", "language"=>"es"}],
#id=["01", "/scheme/01", [{"text"=>"مسائل قانونية وسياسية", "language"=>"ar"}, {"text"=>"政治和法律问题", "language"=>"zh"}, {"text"=>"political and legal questions", "language"=>"en"}, {"text"=>"questions politiques et juridiques", "language"=>"fr"}, {"text"=>"ПОЛИТИЧЕСКИЕ И ЮРИДИЧЕСКИЕ ВОПРОСЫ", "language"=>"ru"}, {"text"=>"cuestiones politicas y juridicas", "language"=>"es"}]],
#in_schemes=["/"],
#top_concepts=[]>
What am I missing here? What is causing this? I have a constructor for a different class that works fine with similar logic. I'm baffled. Maybe there's an approach that would work better?
Try fixing:
#uri = uri,
to:
#uri = uri
As is, you're telling Ruby:
#uri = uri, #labels = labels
Which, as I read it, means you're assigning labels to an array of uri, #labels, then assigning that array to #uri.

For loop inside <<-eos Ruby

I'm a rookie in Ruby language. I'm trying to write a json file with ruby to import it after to a Mongodb collection. I need the document maintain proper indentation to then fill it comfortably
At this moment, I'm doing it in this way, but I'm sure that isn't the recommened way
out_file = File.new('file.json', "w+")
str = <<-eos
{
"key1": #{#value1},
"key2" : #{#value2},
"key3" : {
"subkey_3_1" : {
"key" : #{#value},
"questions" : #{#invalid_questions}
},
"subkey_3_2" : {
"key" : #{value},
"array_key" : [
for i in 1..50
# Here, must be create 50 hash pair-value like this.
{},
{},
{},
...
end
]
}
}
}
eos
out_file.puts(str)
out_file.close
This is the final structure that I want.Thanks, and sorry for not explaining right from the start
How can I define it in ruby?
str = <<-eos
"key" : [
#{for i in 1..50 {
...something content...
}.join("\n") }
]
eos
However - why do you want a string here - I don't know what you are trying to do, but there must be a better way of doing it.
UPDATE:
Yep, as mentioned by #ArupRakshit you need to create the hash first and call to_json on it. If you don't have this method, you need to install gem called active_support and require 'active_support/core_ext' (no need to do this for rails app). Do not build json response manually.

Objectify Ruby Hashes from/to JSON API

I just released a ruby gem to use some JSON over HTTP API:
https://github.com/solyaris/blomming_api
My naif ruby code just convert complex/nested JSON data structures returned by API endpoints (json_data) to ruby Hashes ( hash_data), in a flat one-to-one transaltion (JSON to ruby hash and viceversa). Tat's fine, but...
I would like a programming interface more "high level".
Maybe instatiating a class Resource for every endpoint, but I'm confused about a smart implementation.
Let me explain with an abstract code.
Let say I have a complex/nested JSON received by an API,
usually an Array of Hashes, recursively nested as here below (imagination example):
json_data = '[{
"commute": {
"minutes": 0,
"startTime": "Wed May 06 22:14:12 EDT 2014",
"locations": [
{
"latitude": "40.4220061",
"longitude": "40.4220061"
},
{
"latitude": "40.4989909",
"longitude": "40.48989805"
},
{
"latitude": "40.4111169",
"longitude": "40.42222869"
}
]
}
},
{
"commute": {
"minutes": 2,
"startTime": "Wed May 28 20:14:12 EDT 2014",
"locations": [
{
"latitude": "43.4220063",
"longitude": "43.4220063"
}
]
}
}]'
At the moment what I do, when I receive a similar JSON form an API is just:
# from JSON to hash
hash_data = JSON.load json_data
# and to assign values:
coords = hash_data.first["commute"]["locations"].last
coords["longitude"] = "40.00" # was "40.4111169"
coords["latitude"] = "41.00" # was "40.42222869"
that's ok, but with awfull/confusing syntax.
Instead, I probably would enjoy something like:
# create object Resource from hash
res = Resource.create( hash_data )
# ... some processing
# assign a "nested" variables: longitude, latitude of object: res
coords = res.first.commute.locations.last
coords.longitude = "40.00" # was "40.4111169"
coords.latitude = "41.00" # was "40.42222869"
# ... some processing
# convert modified object: res into an hash again:
modified_hash = res.save
# and probably at least I'll recover to to JSON:
modified_json = JSON.dump modified_hash
I read intresting posts:
http://pullmonkey.com/2008/01/06/convert-a-ruby-hash-into-a-class-object/
http://www.goodercode.com/wp/convert-your-hash-keys-to-object-properties-in-ruby/
and copying Kerry Wilson' code, I sketched the implementation here below:
class Resource
def self.create (hash)
new ( hash)
end
def initialize ( hash)
hash.to_obj
end
def save
# or to_hash()
# todo! HELP! (see later)
end
end
class ::Hash
# add keys to hash
def to_obj
self.each do |k,v|
v.to_obj if v.kind_of? Hash
v.to_obj if v.kind_of? Array
k=k.gsub(/\.|\s|-|\/|\'/, '_').downcase.to_sym
## create and initialize an instance variable for this key/value pair
self.instance_variable_set("##{k}", v)
## create the getter that returns the instance variable
self.class.send(:define_method, k, proc{self.instance_variable_get("##{k}")})
## create the setter that sets the instance variable
self.class.send(:define_method, "#{k}=", proc{|v| self.instance_variable_set("##{k}", v)})
end
return self
end
end
class ::Array
def to_obj
self.map { |v| v.to_obj }
end
end
#------------------------------------------------------------
BTW, I studied a bit ActiveResource project (was part of Rails if I well understood).
ARes could be great for my scope but the problem is ARes have a bit too "strict" presumption of full REST APIs...
In my case server API are not completely RESTfull in the way ARes would expect...
All in all I would do a lot of work to subclass / modify ARes behaviours
and at the moment I discarded the idea to use ActiveResource
QUESTIONS:
someone could help me to realize the save() method on the above code (I'm really bad with recursive methods... :-( ) ?
Does exist some gem that to the above sketched hash_to_object() and object_to_hash() translation ?
What do you think about that "automatic" objectifying of an "arbitrary" hash coming froma JSON over http APIs ?
I mean: I see the great pro that I do not need to client-side static-wire data structures, allowing to be flexible to possible server side variations.
But on the other hand, doing this automatic objectify, there is a possible cons of a side effect to allow security issues ... like malicious JSON injection (possible untrasted communication net ...)
What do you think about all this ? Any suggestion is welcome!
Sorry for my long post and my ruby language metaprogramming azards :-)
giorgio
UPDATE 2: I'm still interested reading opinions about question point 3:
Pros/Cons to create Resource class for every received JSON
Pros/Cons to create static (preemptive attributes) / automatich/dynamic nested objects
UPDATE 1: long reply to Simone:
thanks, you are right Mash have a sweet .to_hash() method:
require 'json'
require 'hashie'
json_data = '{
"commute": {
"minutes": 0,
"startTime": "Wed May 06 22:14:12 EDT 2014",
"locations": [
{
"latitude": "40.4220061",
"longitude": "40.4220061"
},
{
"latitude": "40.4989909",
"longitude": "40.48989805"
},
{
"latitude": "40.4111169",
"longitude": "40.42222869"
}
]
}
}'
# trasforma in hash
hash = JSON.load json_data
puts hash
res = Hashie::Mash.new hash
# assign a "nested" variables: longitude, latitude of object: res
coords = res.commute.locations.last
coords.longitude = "40.00" # was "40.4111169"
coords.latitude = "41.00" # was "40.42222869"
puts; puts "longitude: #{res.commute.locations.last.longitude}"
puts "latitude: #{res.commute.locations.last.latitude}"
modified_hash = res.to_hash
puts; puts modified_hash
This feature is provided by a few gem. One of the most known is Hashie, specifically the class Hashie::Mash.
Mash is an extended Hash that gives simple pseudo-object functionality that can be built from hashes and easily extended. It is designed to be used in RESTful API libraries to provide easy object-like access to JSON and XML parsed hashes.
Mash also supports multi-level objects.
Depending on your needs and level of nesting, you may get away with an OpenStruct.
I was working with a simple test stub. Hashie would have worked well, but was a bigger tool than I needed (and added dependency).

Trouble Getting Value From API Hash - Ruby

I am making an API call and receiving the following response:
{
"id": "http://www.google.com/",
"shares": 8262403,
"comments": 827
}
When I do:
api_call["shares"]
It just returns
shares
...and I want the value of "shares" so I have the share count. What am I missing?
You need to use JSON :
require 'json'
str = '{
"id": "http://www.google.com/",
"shares": 8262403,
"comments": 827
}'
JSON.parse(str)['shares'] # => 8262403
When I do api_call["shares"], It just returns shares.
This is because your response comes as a String. Now on which you are calling String#[] method. The docs str[match_str] → new_str or nil says - If a match_str is given, that string is returned if it occurs in the string.
str['shares'] # => shares
This happened as per the documentation,as I mentioned. Your response string has a substring shares, which is being returned as a method call String#[], in str['shares'] call.

How to dump strings in YAML using literal scalar style?

I have a big string of formatted data (e.g. JSON) that I want to dump to YAML using Psych in ruby while preserving formatting.
Basically, I want for JSON to appear in YAML using literal style:
---
json: |
{
"page": 1,
"results": [
"item", "another"
],
"total_pages": 0
}
However, when I use YAML.dump it doesn't use literal style. I get something like this:
---
json: ! "{\n \"page\": 1,\n \"results\": [\n \"item\", \"another\"\n ],\n \"total_pages\":
0\n}\n"
How can I tell Psych to dump scalars in wanted style?
Solution:
Big thanks to Aaron Patterson for his solution that I'm expanding on here: https://gist.github.com/2023978
Although a bit verbose, that gist is a working way of tagging certain strings in ruby to be output using literal style in YAML.
require 'psych'
# Construct an AST
visitor = Psych::Visitors::YAMLTree.new({})
visitor << DATA.read
ast = visitor.tree
# Find all scalars and modify their formatting
ast.grep(Psych::Nodes::Scalar).each do |node|
node.plain = false
node.quoted = true
node.style = Psych::Nodes::Scalar::LITERAL
end
begin
# Call the `yaml` method on the ast to convert to yaml
puts ast.yaml
rescue
# The `yaml` method was introduced in later versions, so fall back to
# constructing a visitor
Psych::Visitors::Emitter.new($stdout).accept ast
end
__END__
{
"page": 1,
"results": [
"item", "another"
],
"total_pages": 0
}

Resources