JSON to CSV via FasterCSV - ruby

I'm new to Ruby and had a question. I'm trying to create a .rb file that converts JSON to CSV.
I came across some disparate sources that got me to make:
require "rubygems"
require 'fastercsv'
require 'json'
csv_string = FasterCSV.generate({}) do |csv|
JSON.parse(File.open("small.json").read).each do |hash|
csv << hash
end
end
puts csv_string
Now, it does in fact output text but they are all squashed together without spaces, commas etc. How do I make it more customised, clear for a CSV file so I can export that file?
The JSON would look like:
{
"results": [
{
"reportingId": "s",
"listingType": "Business",
"hasExposureProducts": false,
"name": "Medeco Medical Centre World Square",
"primaryAddress": {
"geoCodeGranularity": "PROPERTY",
"addressLine": "Shop 9.01 World Sq Shopng Cntr 644 George St",
"longitude": "151.206172",
"suburb": "Sydney",
"state": "NSW",
"postcode": "2000",
"latitude": "-33.876416",
"type": "VANITY"
},
"primaryContacts": [
{
"type": "PHONE",
"value": "(02) 9264 8500"
}
]
},xxx
}
The CSV to just have something like:
reportingId, s, listingType, Business, name, Medeco Medical...., addressLine, xxxxx, longitude, xxxx, latitude, xxxx, state, NSW, postcode, 2000, type, phone, value, (02) 92648544

Since your JSON structure is a mix of hashes and lists, and also has levels of different heights, it is not as trivial as the code you show. However (assuming your input files always look the same) it shouldn't be hard to write an appropriate converter. On the lowest level, you can transform a hash to CSV by
hash.to_a.flatten
E.g.
input = JSON.parse(File.open("small_file.json").read)
writer = FasterCSV.open("out.csv", "w")
writer << input["results"][0]["primaryAddress"].to_a.flatten
will give you
type,VANITY,latitude,-33.876416,postcode,2000,state,NSW,suburb,Sydney,longitude,151.206172,addressLine,Shop 9.01 World Sq Shopng Cntr 644 George St,geoCodeGranularity,PROPERTY
Hope that guides you the direction.
Btw, your JSON looks invalid. You should change the },xxx line to }].

Related

Looping through URL array to parse html, does not loop

I am trying to extract products description, the first loop runs through each product and nested loop enters each product page and grabs description to extract.
for page in range(1, 2):
guitarPage =
requests.get('https://www.guitarguitar.co.uk/guitars/acoustic/page-
{}'.format(page)).text
soup = BeautifulSoup(guitarPage, 'lxml')
guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')
this is the loop for each product
for guitar in guitars:
title_text = guitar.h3.text.strip()
print('Guitar Name: ', title_text)
price = guitar.find(class_='price bold small').text.strip()
print('Guitar Price: ', price)
priceSave = guitar.find('span', {'class': 'price save'})
if priceSave is not None:
priceOf = priceSave.text
print(priceOf)
else:
print("No discount!")
image = guitar.img.get('src')
print('Guitar Image: ', image)
productLink = guitar.find('a').get('href')
linkProd = url + productLink
print('Link of product', linkProd)
here i am adding the links collected to an array
productsPage.append(linkProd)
here is my attempt at entering each product page and extracting the description
for products in productsPage:
response = requests.get(products)
soup = BeautifulSoup(response.content, "lxml")
productsDetails = soup.find("div", {"class":"description-preview"})
if productsDetails is not None:
description = productsDetails.text
# print('product detail: ', description)
else:
print('none')
time.sleep(0.2)
if None not in(title_text,price,image,linkProd, description):
products = {
'title': title_text,
'price': price,
'discount': priceOf,
'image': image,
'link': linkProd,
'description': description,
}
result.append(products)
with open('datas.json', 'w') as outfile:
json.dump(result, outfile, ensure_ascii=False, indent=4, separators=(',', ': '))
# print(result)
print('--------------------------')
time.sleep(0.5)
The outcome should be
{
"title": "Yamaha NTX700 Electro Classical Guitar (Pre-Owned) #HIM041005",
"price": "£399.00",
"discount": null,
"image": "https://images.guitarguitar.co.uk/cdn/large/150/PXP190415342158006-3115645f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/pxp190415342158006-3115645--yamaha-ntx700-electro-classical-guitar-pre-owned-him",
"description": "\nProduct Overview\nThe versatile, contemporary styled NTX line is designed with thinner bodies, narrower necks, 14th fret neck joints, and cutaway designs to provide greater comfort and playability f... read more\n"
},
but the description works for the first one and does not change later on.
[
{
"title": "Yamaha APX600FM Flame Maple Tobacco Sunburst",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340677008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340677008--yamaha-apx600fm-flame-maple-tobacco-sunburst",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha APX600FM Flame Maple Amber",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340676008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340676008--yamaha-apx600fm-flame-maple-amber",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha AC1R Acoustic Electric Concert Size Rosewood Back And Sides with SRT Pickup",
"price": "£399.00",
"discount": "Save £267.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/105/11012414211132.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/11012414211132--yamaha-ac1r-acoustic-electric-concert-size-rosewood-back-and-sid",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
}
]
this is the result I am getting, It changes all the time, sometimes it shows the previous description of the product
It does loop but it seems there are protective measures in place server side and the pages which fail change. The pages which did fail I checked and they had the searched for content. No single measure seems to suffice in my testing (I didn't try sleep over 2 but did try some IP and User-Agent changes with sleeps <=2.)
You could try alternating IPs and User-Agents, back off retries, changing time between requests.
Changing proxies: https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
Changing User-Agent: https://pypi.org/project/fake-useragent/

Highlight part of code block

I have a very large code block in my .rst file, which I would like to highlight just a small portion of and make it bold. Consider the following rst:
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
**Example 1: Explain showing a table scan operation**::
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}
When it converts to html, it syntax highlights by default (good), but I also want to specify a few lines that should be bold (the ones with comments on them, but possibly others too.)
I was thinking of adding a trailing character sequence on the line (.e.g. ###) and then writing a post-parser script to modify the html files generated. Is there a better way?
The code-block directive has an emphasize-lines option. The following highlights the lines with comments in your code.
**Example 1: Explain showing a table scan operation**
.. code-block:: python
:emphasize-lines: 7, 11, 12
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}

Ruby, Parsing a JSON response for an array of values

I'm new to ruby so please excuse any ignorance I may bear. I was wondering how to parse a JSON reponse for every value belonging to a specific key. The response is in the format,
[
{
"id": 10008,
"name": "vpop-fms-inventory-ws-client",
"msr": [
{
"key": "blocker_violations",
"val": 0,
"frmt_val": "0"
},
]
},
{
"id": 10422,
"name": "websample Maven Webapp",
"msr": [
{
"key": "blocker_violations",
"val": 0,
"frmt_val": "0"
}...
There's some other entries in the response, but for the sake of not having a huge block of code, I've shortened it.The code I've written is:
require 'uri'
require 'net/http'
require 'JSON'
url = URI({my url})
http = Net::HTTP.new(url.host, url.port)
request = Net::HTTP::Get.new(url)
request["cache-control"] = 'no-cache'
request["postman-token"] = '69430784-307c-ea1f-a488-a96cdc39e504'
response = http.request(request)
parsed = response.read_body
h = JSON.parse(parsed)
num = h["msr"].find {|h1| h1['key']=='blocker_violations'}['val']
I am essentially looking for the val for each blocker violation (the json reponse contains hundreds of entries, so im expecting hundreds of blocker values). I had hoped num would contain an array of all the 'val's. If you have any insight in this, it would be of great help!
EDIT! I'm getting a console output of
scheduler caught exception:
no implicit conversion of String into Integer
C:/dashing/test_board/jobs/issue_types.rb:20:in `[]'
C:/dashing/test_board/jobs/issue_types.rb:20:in `block (2 levels) in <top (requi
red)>'
C:/dashing/test_board/jobs/issue_types.rb:20:in `select'
I suspect that might have too much to do with the question, but some help is appreciated!
You need to do 2 things. Firstly, you're being returned an array and you're only interested in a subset of the elements. This is a common pattern that is solved by a filter, or select in Ruby. Secondly, the condition by which you wish to select these elements also depends on the values of another array, which you need to filter using a different technique. You could attempt it like this:
res = [
{
"id": 10008,
"name": "vpop-fms-inventory-ws-client",
"msr": [
{
"key": "blocker_violations",
"val": 123,
"frmt_val": "0"
}
]
},
{
"id": 10008,
"name": "vpop-fms-inventory-ws-client",
"msr": [
{
"key": "safe",
"val": 0,
"frmt_val": "0"
}
]
}
]
# define a lambda function that we will use later on to filter out the blocker violations
violation = -> (h) { h[:key] == 'blocker_violations' }
# Select only those objects who contain any msr with a key of blocker_violations
violations = res.select {|h1| h1[:msr].any? &violation }
# Which msr value should we take? Here I just take the first.
values = violations.map {|v| v[:msr].first[:val] }
The problem you may have with this code is that msr is an array. So theoretically, you could end up with 2 objects in msr, one that is a blocker violation and one that is not. You have to decide how you handle that. In my example, I include it if it has a single blocker violation through the use of any?. However, you may wish to only include them if all msr objects are blocker violations. You can do this via the all? method.
The second problem you then face is, which value to return? If there are multiple blocker violations in the msr object, which value do you choose? I just took the first one - but this might not work for you.
Depending on your requirements, my example might work or you might need to adapt it.
Also, if you've never come across the lambda syntax before, you can read more about it here

Append multiple attribute values inside csv

I have stored the data in the following JSON/XML. Please find below link. I am looking to store the values of des_facet, org_facet, per_facet, geo_facet in my CSV in array. At the moment the values stored in my hash map stores these values in a separate column.
hash = article.attributes.select {|k,v| !["author","images","guid","link"].include?(k) }
hash_new = []
hash.values.map do |v|
hash_new.push("\""+v.to_s+"\"")
end
hash_new.map(&:to_s).join(", ")
Sample JSON:
{
"articles": [{
"results": [{
"title": "Ad Blockers and the Nuisance at the Heart of the Modern Web",
"summary": "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.",
"source": "http://www.nytimes.com/2015/08/20/technology/personaltech/ad-blockers-and-the-nuisance-at-the-heart-of-the-modern-web.html",
"date": "2015-08-20T00:00:00-5:00",
"section": "Technology",
"item_type": "Article",
"updated_date": "2015-08-19T16:05:01-5:00",
"created_date": "2015-08-19T05:00:06-5:00",
"material_type_facet": "News",
"abstract": "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.",
"byline": "By FARHAD MANJOO",
"kicker": "",
"des_facet": ["Online Advertising", "Computers and the Internet", "Data-Mining and Database Marketing", "Privacy", "Advertising and Marketing", "Mobile Applications"],
"org_facet": ["Adblock Plus"],
"per_facet": "",
"geo_facet": ""
}]
}]
}
I want the respective CSV for the same format. Currently below is what I get.
"Ad Blockers and the Nuisance at the Heart of the Modern Web", "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.", "http://www.nytimes.com/2015/08/20/technology/personaltech/ad-blockers-and-the-nuisance-at-the-heart-of-the-modern-web.html", "2015-08-20T00:00:00-5:00", "Technology", "Article", "2015-08-19T16:05:01-5:00", "2015-08-19T05:00:06-5:00", "News", "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.", "By FARHAD MANJOO", "", "["Online Advertising", "Computers and the Internet", "Data-Mining and Database Marketing", "Privacy", "Advertising and Marketing", "Mobile Applications"]", "["Adblock Plus"]", "", ""
I am not sure how to do this. I am quite new to Ruby. I have thought of using grep probably and look out for value with /[\]]/
You should try to avoid writing CSV yourself, Ruby has a CSV class included that does all escaping for you.
unwanted_attributes = ["author", "images", "guid", "link"]
sanitized_attributes = article.attributes.select { |attribute_name, _|
!unwanted_attributes.include?(attribute_name)
}
csv_string = CSV.generate do |csv|
csv << sanitized_attributes.values
end

Hash to JSON then joining with another JSON in Ruby

I have a JSON like this:
[
{
"Low": 8.63,
"Volume": 14211900,
"Date": "2012-10-26",
"High": 8.79,
"Close": 8.65,
"Adj Close": 8.65,
"Open": 8.7
},
{
"Low": 8.65,
"Volume": 12167500,
"Date": "2012-10-25",
"High": 8.81,
"Close": 8.73,
"Adj Close": 8.73,
"Open": 8.76
},
{
"Low": 8.68,
"Volume": 20239700,
"Date": "2012-10-24",
"High": 8.92,
"Close": 8.7,
"Adj Close": 8.7,
"Open": 8.85
}
]
And have calculated a simple moving average for each day of the closing prices and called it a variable sma9day. I'd like to join the moving average values with the original JSON, so I get something like this for each day:
{
"Low": 8.68,
"Volume": 20239700,
"Date": "2012-10-24",
"High": 8.92,
"Close": 8.7,
"Adj Close": 8.7,
"Open": 8.85,
"SMA9": 8.92
}
With the sma9day variable I did this:
h = { "SMA9" => sma9day }
sma9json = h.to_json
puts sma9json
which outputs this:
{"SMA9":[8.92,8.93,8.93]}
How do I put it in a compatible format with the JSON and join the two? I'll need to "match/join" from the top down, as the last 8 records in the JSON will not have 9 day moving average values (in these cases I'd still like the key to be there (SMA9), but have nil or zero as the value.
Thank you.
LATEST UPDATE:
I now have this, which gets me very close, however it returns the entire string in the SMA9 field in the JSON...
require json
require simple_statistics
json = File.read("test.json")
quotes = JSON.parse(json)
# Calculations
def sma9day(quotes, i)
close = quotes.collect {|quote| quote['Close']}
sma9day = close.each_cons(9).collect {|close| close.mean}
end
quotes = quotes.each_with_index do |day, i|
day['SMA9'] = sma9day(quotes, i)
end
p quotes[0]
=> {"Low"=>8.63, "Volume"=>14211900, "Date"=>"2012-10-26", "High"=>8.79, "Close"=>8.65, "Adj Close"=>8.65, "Open"=>8.7, "SMA9"=>[8.922222222222222, 8.93888888888889, 8.934444444444445, 8.94222222222222, 8.934444444444445, 8.937777777777777, 8.95, 8.936666666666667, 8.924444444444443, 8.906666666666666, 8.912222222222221, 8.936666666666666, 8.946666666666665, 8.977777777777778, 8.95111111111111, 8.92, 8.916666666666666]}
When I try to do sma9day.round(2) before the end of the calculations, it gives a method error (presumably because of the array?), and when I did sma9day[0].round(2), it does correctly round, but every record has the same SMA of course.
Any help is appreciated. Thanks
Presumably to do the calculation in ruby, you somehow parsed the json, and got a ruby Hash out of it.
To get this straight, you have an array of sma9day values, and an array of objects, and you want to iterate through them.
To do that, something like this should get you started:
hashes = JSON.parse( json )
sma9day_values = [9.83, 9.82, etc... ]
hashes.each_with_index do |hash, index|
if index >= 9
hash["SMA9"] = sma9day_values[index-9]
else
hash["SMA9"] = 0
end
end
puts hashes.to_json
Edit:
You really need to try a beginning ruby tutorial. The problem is that you are calling round(2) on an array. The variable i in the sma9day(quotes, i) function is not used (hint). Maybe try something like sma9day[i].round(2)
Also the return of each_with_index is not something to assign. Dont do that, just call each_with_index on an array. I.e.
quotes = quotes.each_with_index do |day, i| #bad
quotes.each_with_index do |day, i| #good
I took your input and compiled a solution in this gist. I hope it helps.

Resources