Identify text-based format - format

Is the data below in a well-known format, or is this a custom format invented by the generator?
[{
"tmsId": "MV006574730000",
"rootId": "11214341",
"subType": "Feature Film",
"title": "Doctor Strange 3D",
"releaseYear": 2016,
"releaseDate": "2016-11-04",
"titleLang": "en",
"descriptionLang": "en",
"entityType": "Movie",
"genres": ["Action", "Adventure", "Fantasy"],
"longDescription": "Dr. Stephen Strange's (Benedict Cumberbatch) life changes after a car accident robs him of the use of his hands.
When traditional medicine fails him, he looks for healing, and hope,
in a mysterious enclave. He quickly learns that the enclave is at the
front line of a battle against unseen dark forces bent on destroying
reality. Before long, Strange is forced to choose between his life of
fortune and status or leave it all behind to defend the world as the
most powerful sorcerer in existence.",
"shortDescription": "Dr. Stephen Strange discovers the world of magic after meeting the Ancient One.",
"topCast": ["Benedict Cumberbatch", "Chiwetel Ejiofor", "Rachel McAdams"],
"directors": ["Scott Derrickson"],
"officialUrl": "http://marvel.com/doctorstrange",
"ratings": [{
"body": "Motion Picture Association of America",
"code": "PG-13"
}],

Well this is indeed JSON format. I suppose the chunk of data you are giving us here are not the complete data. Because there missing some closing brackets. Well if you delete the last comma "," and put there these: "}]".
Then as you can see it passes validation in the jsonlint.
You can try this here: jsonlint.com

Related

Google Natural Language Syntax Analysis

I'm not sure this is a coding issue/question.
I'm using Google's NLP to analyze the syntax of some sentences and I'm seeing some inconsistencies with Plural vs Singular designation. Perhaps I'm doing something wrong or misunderstanding what I see as an inconsistency.
For example.
The dolphins jump over the wall
The word dolphins is labeled as "SINGULAR" and I was expecting "PLURAL". I thought, maybe cause it's referring to the group, as ONE "school of fish"(although they are mammals)
So I tried Crows
The crows jump over the wall
The crows are jumping over the wall
Both of these return crows as "SINGULAR", which I thought would be consistent since a group of crows is ONE "Murder of Crows"
Ok, fine then I tried Cows - a group of cows is ONE Herd
The cows jump over the wall
But in this sentence, the word cows is labeled "PLURAL".
I'm no linguistics expert that maybe be a cause of my confusion.
Or is this "inconsistency" due to analyzing the sentence ONLY using the analyzeSyntax API without analyzing its sentiment or the entities?
This is the log for The cows jump over the wall.
{ theSentence: 'The cows jump over the wall.',
theTags: [ 'DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN', 'PUNCT' ],
theLabels: [ 'DET', 'NSUBJ', 'ROOT', 'PREP', 'DET', 'POBJ', 'P' ],
theNumbers:
[ 'NUMBER_UNKNOWN',
'PLURAL',
'SINGULAR',
'NUMBER_UNKNOWN',
'NUMBER_UNKNOWN',
'SINGULAR',
'NUMBER_UNKNOWN' ]
This is the log for The crows jump over the wall.
{ theSentence: 'The crows jump over the wall.',
theTags: [ 'DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN', 'PUNCT' ],
theLabels: [ 'DET', 'NSUBJ', 'ROOT', 'PREP', 'DET', 'POBJ', 'P' ],
theNumbers:
[ 'NUMBER_UNKNOWN',
'SINGULAR',
'SINGULAR',
'NUMBER_UNKNOWN',
'NUMBER_UNKNOWN',
'SINGULAR',
'NUMBER_UNKNOWN' ]
Update : I tried using https://language.googleapis.com/v1beta2/documents:analyzeSyntax and I get the same results

Looping through URL array to parse html, does not loop

I am trying to extract products description, the first loop runs through each product and nested loop enters each product page and grabs description to extract.
for page in range(1, 2):
guitarPage =
requests.get('https://www.guitarguitar.co.uk/guitars/acoustic/page-
{}'.format(page)).text
soup = BeautifulSoup(guitarPage, 'lxml')
guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')
this is the loop for each product
for guitar in guitars:
title_text = guitar.h3.text.strip()
print('Guitar Name: ', title_text)
price = guitar.find(class_='price bold small').text.strip()
print('Guitar Price: ', price)
priceSave = guitar.find('span', {'class': 'price save'})
if priceSave is not None:
priceOf = priceSave.text
print(priceOf)
else:
print("No discount!")
image = guitar.img.get('src')
print('Guitar Image: ', image)
productLink = guitar.find('a').get('href')
linkProd = url + productLink
print('Link of product', linkProd)
here i am adding the links collected to an array
productsPage.append(linkProd)
here is my attempt at entering each product page and extracting the description
for products in productsPage:
response = requests.get(products)
soup = BeautifulSoup(response.content, "lxml")
productsDetails = soup.find("div", {"class":"description-preview"})
if productsDetails is not None:
description = productsDetails.text
# print('product detail: ', description)
else:
print('none')
time.sleep(0.2)
if None not in(title_text,price,image,linkProd, description):
products = {
'title': title_text,
'price': price,
'discount': priceOf,
'image': image,
'link': linkProd,
'description': description,
}
result.append(products)
with open('datas.json', 'w') as outfile:
json.dump(result, outfile, ensure_ascii=False, indent=4, separators=(',', ': '))
# print(result)
print('--------------------------')
time.sleep(0.5)
The outcome should be
{
"title": "Yamaha NTX700 Electro Classical Guitar (Pre-Owned) #HIM041005",
"price": "£399.00",
"discount": null,
"image": "https://images.guitarguitar.co.uk/cdn/large/150/PXP190415342158006-3115645f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/pxp190415342158006-3115645--yamaha-ntx700-electro-classical-guitar-pre-owned-him",
"description": "\nProduct Overview\nThe versatile, contemporary styled NTX line is designed with thinner bodies, narrower necks, 14th fret neck joints, and cutaway designs to provide greater comfort and playability f... read more\n"
},
but the description works for the first one and does not change later on.
[
{
"title": "Yamaha APX600FM Flame Maple Tobacco Sunburst",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340677008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340677008--yamaha-apx600fm-flame-maple-tobacco-sunburst",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha APX600FM Flame Maple Amber",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340676008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340676008--yamaha-apx600fm-flame-maple-amber",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha AC1R Acoustic Electric Concert Size Rosewood Back And Sides with SRT Pickup",
"price": "£399.00",
"discount": "Save £267.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/105/11012414211132.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/11012414211132--yamaha-ac1r-acoustic-electric-concert-size-rosewood-back-and-sid",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
}
]
this is the result I am getting, It changes all the time, sometimes it shows the previous description of the product
It does loop but it seems there are protective measures in place server side and the pages which fail change. The pages which did fail I checked and they had the searched for content. No single measure seems to suffice in my testing (I didn't try sleep over 2 but did try some IP and User-Agent changes with sleeps <=2.)
You could try alternating IPs and User-Agents, back off retries, changing time between requests.
Changing proxies: https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
Changing User-Agent: https://pypi.org/project/fake-useragent/

How do I get RestClient in ruby to properly format responses with special characters?

I have an array of objects I am sending to a REST API to receive back information on those objects. To do this I am using RestClient with the following lines to send the call and parse the response.
response_raw = RestClient.get "http://#{re_host}:#{re_port}/reachengine/api/inventory/search?rql=fCSAssetNumber=#{fcs_id_num}%20size%20#{size}%20&apiKey=#{api_key}", headers
response_json = Crack::JSON.parse(response_raw)
response_json['results'].each do |result|
For the first 20+ records I perform this action on, everything works fine. Then I start to get a NoMethodError: undefined method `[]' for nil:NilClass
When I run the code step by step in IRB, what I see in the results is very strange
result = response_json['results'][0]
>=> {"name"=>"Publicaciòn_Listin_Diario.png", " id"=>" 294290", " dateCreated"=>" 2015-09-20T20:35:06.000+0000", " dateUpdated"=>" 2015-12-23T19:33:13.000+0000", " systemKeywords"=>" Publicaciòn_Listin_Diario.png Image ", "t humbnailId"=>"4 24725", "m etadata"=>{" sourceFilePath"=>"/ Volumes/ONLINE_DAM/MEDIA/RAW_GRAPHICS/1307001_August_2013_KOS/Publicaciòn_Listin_Diario.png", "pa MdCustAgency_picklist_sortable"=>"nul l", "th umbnailAssetFlag"=>"fal se", "re storeKey"=>"nul l", "ar chiveStatus_picklist_sortable"=>"nul l", "fC SAssetNumber"=>"18 2725", "fC SMetadataSet"=>"Ra w Graphic", "cu stKeywords"=>"Do minican Republic Cycling Team, 1307001 August 2013 KOS Kickoff show", "cu stAssetStatus_picklist_sortable"=>"nul l", "se archableFlag"=>"fal se", "as setType"=>"Im age", "pa MdCustHerbalifeJobNumber"=>"13 07001", "da teCreated"=>"20 15-09-20T20:35:06", "da teLocked"=>"nul l", "uu id"=>"30 9d9bb3-6935-4ab6-a04a-ef7264132bc6", "ve rsionFlag"=>"nul l", "ag ency_picklist_sortable"=>"nul l", "pr oducer_picklist_sortable"=>"nul l", "tr uncatedFlag"=>"fal se", "cu stDescription"=>"*R aw Graphics for 1307001_August_2013_KOS"}, "in ventoryKey"=>"im age"}
Usually, with this response, I can run
result['metadata']['fCSAssetNumber']
However; because of the random spaces, this fails with a "NoMethodError: undefined method `[]' for nil:NilClass" because instead of the string being 'metadata' it is actually 'm etadata'
Whats really strange about all of this and why this is a Ruby issue and not easily determined to be the API's issue is that the same exact call made via Postman REST Client in chrome returns this result:
>{
"results": [
{
"name": "Publicaciòn_Listin_Diario.png",
"id": "294290",
"dateCreated": "2015-09-20T20:35:06.000+0000",
"dateUpdated": "2015-12-23T19:33:13.000+0000",
"systemKeywords": "Publicaciòn_Listin_Diario.png Image ",
"thumbnailId": "424725",
"metadata": {
"sourceFilePath": "/Volumes/ONLINE_DAM/MEDIA/RAW_GRAPHICS/1307001_August_2013_KOS/Publicaciòn_Listin_Diario.png",
"paMdCustAgency_picklist_sortable": null,
"thumbnailAssetFlag": false,
"restoreKey": null,
"archiveStatus_picklist_sortable": null,
"fCSAssetNumber": "182725",
"fCSMetadataSet": "Raw Graphic",
"custKeywords": "Dominican Republic Cycling Team, 1307001 August 2013 KOS Kickoff show",
"custAssetStatus_picklist_sortable": null,
"searchableFlag": false,
"assetType": "Image",
"paMdCustHerbalifeJobNumber": "1307001",
"dateCreated": "2015-09-20T20:35:06",
"dateLocked": null,
"uuid": "309d9bb3-6935-4ab6-a04a-ef7264132bc6",
"versionFlag": null,
"agency_picklist_sortable": null,
"producer_picklist_sortable": null,
"truncatedFlag": false,
"custDescription": "*Raw Graphics for 1307001_August_2013_KOS"
},
"inventoryKey": "image"
}
],
"total": "1"
}
Ad you can see above, when Postman runs the same exact call there is no issue with the response, but when ruby runs the call there is. Also note that this doesnt happen all of the time.
Below is a sample response from the same exact ruby call that actually worked.
result = response_json['results'][0]
=> {"name"=>"Marco_1er_dia.png", "id"=>"294284", "dateCreated"=>"2015-09-`20T20:34:54.000+0000", "dateUpdated"=>"2015-12-23T19:33:10.000+0000", "systemKeywords"=>"Marco_1er_dia.png Image ", "thumbnailId"=>"424716", "metadata"=>{"sourceFilePath"=>"/Volumes/ONLINE_DAM/MEDIA/RAW_GRAPHICS/1307001_August_2013_KOS/Marco_1er_dia.png", "paMdCustAgency_picklist_sortable"=>nil, "collectionMemberships"=>"320 321", "thumbnailAssetFlag"=>false, "restoreKey"=>nil, "fCSMetadataSet"=>"Raw Graphic", "fCSAssetNumber"=>"182722", "archiveStatus_picklist_sortable"=>nil, "custAssetStatus_picklist_sortable"=>nil, "custKeywords"=>"1307001 August 2013 KOS Kickoff show, Dominican Republic Cycling Team", "searchableFlag"=>false, "assetType"=>"Image", "paMdCustHerbalifeJobNumber"=>"1307001", "dateCreated"=>"2015-09-20T20:34:54", "dateLocked"=>nil, "uuid"=>"b5e55c14-b94e-4629-9e2a-61a2dc0876f6", "versionFlag"=>nil, "fCSProductionStatus_picklist_sortable"=>nil, "agency_picklist_sortable"=>nil, "producer_picklist_sortable"=>nil, "truncatedFlag"=>false, "custDescription"=>"*Raw Graphics for 1307001_August_2013_KOS"}, "inventoryKey"=>"image"}`
Notice how the response above has no spacing issue? The only glaring difference I can see here is that there is a special character in use in the filename: ò
Is there something specific I need to do for RESTClient to work with this?
Anyone have any idea how this can be fixed?
The issue was related to the 'crack' gem. I've used this for a long time. Apparently, as of Ruby 1.9, the parse method has been available on the standard JSON class. When I switched to using this by changing
response_json = Crack::JSON.parse(response_raw)
to
response_json = JSON.parse(response_raw)
The issue went away.

Append multiple attribute values inside csv

I have stored the data in the following JSON/XML. Please find below link. I am looking to store the values of des_facet, org_facet, per_facet, geo_facet in my CSV in array. At the moment the values stored in my hash map stores these values in a separate column.
hash = article.attributes.select {|k,v| !["author","images","guid","link"].include?(k) }
hash_new = []
hash.values.map do |v|
hash_new.push("\""+v.to_s+"\"")
end
hash_new.map(&:to_s).join(", ")
Sample JSON:
{
"articles": [{
"results": [{
"title": "Ad Blockers and the Nuisance at the Heart of the Modern Web",
"summary": "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.",
"source": "http://www.nytimes.com/2015/08/20/technology/personaltech/ad-blockers-and-the-nuisance-at-the-heart-of-the-modern-web.html",
"date": "2015-08-20T00:00:00-5:00",
"section": "Technology",
"item_type": "Article",
"updated_date": "2015-08-19T16:05:01-5:00",
"created_date": "2015-08-19T05:00:06-5:00",
"material_type_facet": "News",
"abstract": "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.",
"byline": "By FARHAD MANJOO",
"kicker": "",
"des_facet": ["Online Advertising", "Computers and the Internet", "Data-Mining and Database Marketing", "Privacy", "Advertising and Marketing", "Mobile Applications"],
"org_facet": ["Adblock Plus"],
"per_facet": "",
"geo_facet": ""
}]
}]
}
I want the respective CSV for the same format. Currently below is what I get.
"Ad Blockers and the Nuisance at the Heart of the Modern Web", "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.", "http://www.nytimes.com/2015/08/20/technology/personaltech/ad-blockers-and-the-nuisance-at-the-heart-of-the-modern-web.html", "2015-08-20T00:00:00-5:00", "Technology", "Article", "2015-08-19T16:05:01-5:00", "2015-08-19T05:00:06-5:00", "News", "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.", "By FARHAD MANJOO", "", "["Online Advertising", "Computers and the Internet", "Data-Mining and Database Marketing", "Privacy", "Advertising and Marketing", "Mobile Applications"]", "["Adblock Plus"]", "", ""
I am not sure how to do this. I am quite new to Ruby. I have thought of using grep probably and look out for value with /[\]]/
You should try to avoid writing CSV yourself, Ruby has a CSV class included that does all escaping for you.
unwanted_attributes = ["author", "images", "guid", "link"]
sanitized_attributes = article.attributes.select { |attribute_name, _|
!unwanted_attributes.include?(attribute_name)
}
csv_string = CSV.generate do |csv|
csv << sanitized_attributes.values
end

JSON to CSV via FasterCSV

I'm new to Ruby and had a question. I'm trying to create a .rb file that converts JSON to CSV.
I came across some disparate sources that got me to make:
require "rubygems"
require 'fastercsv'
require 'json'
csv_string = FasterCSV.generate({}) do |csv|
JSON.parse(File.open("small.json").read).each do |hash|
csv << hash
end
end
puts csv_string
Now, it does in fact output text but they are all squashed together without spaces, commas etc. How do I make it more customised, clear for a CSV file so I can export that file?
The JSON would look like:
{
"results": [
{
"reportingId": "s",
"listingType": "Business",
"hasExposureProducts": false,
"name": "Medeco Medical Centre World Square",
"primaryAddress": {
"geoCodeGranularity": "PROPERTY",
"addressLine": "Shop 9.01 World Sq Shopng Cntr 644 George St",
"longitude": "151.206172",
"suburb": "Sydney",
"state": "NSW",
"postcode": "2000",
"latitude": "-33.876416",
"type": "VANITY"
},
"primaryContacts": [
{
"type": "PHONE",
"value": "(02) 9264 8500"
}
]
},xxx
}
The CSV to just have something like:
reportingId, s, listingType, Business, name, Medeco Medical...., addressLine, xxxxx, longitude, xxxx, latitude, xxxx, state, NSW, postcode, 2000, type, phone, value, (02) 92648544
Since your JSON structure is a mix of hashes and lists, and also has levels of different heights, it is not as trivial as the code you show. However (assuming your input files always look the same) it shouldn't be hard to write an appropriate converter. On the lowest level, you can transform a hash to CSV by
hash.to_a.flatten
E.g.
input = JSON.parse(File.open("small_file.json").read)
writer = FasterCSV.open("out.csv", "w")
writer << input["results"][0]["primaryAddress"].to_a.flatten
will give you
type,VANITY,latitude,-33.876416,postcode,2000,state,NSW,suburb,Sydney,longitude,151.206172,addressLine,Shop 9.01 World Sq Shopng Cntr 644 George St,geoCodeGranularity,PROPERTY
Hope that guides you the direction.
Btw, your JSON looks invalid. You should change the },xxx line to }].

Resources