I am trying to extract products description, the first loop runs through each product and nested loop enters each product page and grabs description to extract.
for page in range(1, 2):
guitarPage =
requests.get('https://www.guitarguitar.co.uk/guitars/acoustic/page-
{}'.format(page)).text
soup = BeautifulSoup(guitarPage, 'lxml')
guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')
this is the loop for each product
for guitar in guitars:
title_text = guitar.h3.text.strip()
print('Guitar Name: ', title_text)
price = guitar.find(class_='price bold small').text.strip()
print('Guitar Price: ', price)
priceSave = guitar.find('span', {'class': 'price save'})
if priceSave is not None:
priceOf = priceSave.text
print(priceOf)
else:
print("No discount!")
image = guitar.img.get('src')
print('Guitar Image: ', image)
productLink = guitar.find('a').get('href')
linkProd = url + productLink
print('Link of product', linkProd)
here i am adding the links collected to an array
productsPage.append(linkProd)
here is my attempt at entering each product page and extracting the description
for products in productsPage:
response = requests.get(products)
soup = BeautifulSoup(response.content, "lxml")
productsDetails = soup.find("div", {"class":"description-preview"})
if productsDetails is not None:
description = productsDetails.text
# print('product detail: ', description)
else:
print('none')
time.sleep(0.2)
if None not in(title_text,price,image,linkProd, description):
products = {
'title': title_text,
'price': price,
'discount': priceOf,
'image': image,
'link': linkProd,
'description': description,
}
result.append(products)
with open('datas.json', 'w') as outfile:
json.dump(result, outfile, ensure_ascii=False, indent=4, separators=(',', ': '))
# print(result)
print('--------------------------')
time.sleep(0.5)
The outcome should be
{
"title": "Yamaha NTX700 Electro Classical Guitar (Pre-Owned) #HIM041005",
"price": "£399.00",
"discount": null,
"image": "https://images.guitarguitar.co.uk/cdn/large/150/PXP190415342158006-3115645f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/pxp190415342158006-3115645--yamaha-ntx700-electro-classical-guitar-pre-owned-him",
"description": "\nProduct Overview\nThe versatile, contemporary styled NTX line is designed with thinner bodies, narrower necks, 14th fret neck joints, and cutaway designs to provide greater comfort and playability f... read more\n"
},
but the description works for the first one and does not change later on.
[
{
"title": "Yamaha APX600FM Flame Maple Tobacco Sunburst",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340677008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340677008--yamaha-apx600fm-flame-maple-tobacco-sunburst",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha APX600FM Flame Maple Amber",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340676008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340676008--yamaha-apx600fm-flame-maple-amber",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha AC1R Acoustic Electric Concert Size Rosewood Back And Sides with SRT Pickup",
"price": "£399.00",
"discount": "Save £267.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/105/11012414211132.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/11012414211132--yamaha-ac1r-acoustic-electric-concert-size-rosewood-back-and-sid",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
}
]
this is the result I am getting, It changes all the time, sometimes it shows the previous description of the product
It does loop but it seems there are protective measures in place server side and the pages which fail change. The pages which did fail I checked and they had the searched for content. No single measure seems to suffice in my testing (I didn't try sleep over 2 but did try some IP and User-Agent changes with sleeps <=2.)
You could try alternating IPs and User-Agents, back off retries, changing time between requests.
Changing proxies: https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/
Changing User-Agent: https://pypi.org/project/fake-useragent/
Related
I'm learning Scrapy. As an exercise I want to get the product title in this web page https://scrapingclub.com/exercise/detail_json/ using this code:
scrapy shell "https://scrapingclub.com/exercise/detail_json/"
response.xpath("//h3[1]/text()")
[]
but the only thing I get is nothing (a zero dim dic).
Try this,
response.xpath("//script[contains(., 'title')]/text()")
if you press control+u you can see this information at the footer
var obj = {
"title": "Short Sweatshirt",
"price": "$24.99",
"description": "Short sweatshirt with long sleeves and ribbing at neckline, cuffs, and hem. 57% cotton, 43% polyester. Machine wash
cold.",
"img_path": "/static/img/" + "96230-C" + ".jpg" };
Is the data below in a well-known format, or is this a custom format invented by the generator?
[{
"tmsId": "MV006574730000",
"rootId": "11214341",
"subType": "Feature Film",
"title": "Doctor Strange 3D",
"releaseYear": 2016,
"releaseDate": "2016-11-04",
"titleLang": "en",
"descriptionLang": "en",
"entityType": "Movie",
"genres": ["Action", "Adventure", "Fantasy"],
"longDescription": "Dr. Stephen Strange's (Benedict Cumberbatch) life changes after a car accident robs him of the use of his hands.
When traditional medicine fails him, he looks for healing, and hope,
in a mysterious enclave. He quickly learns that the enclave is at the
front line of a battle against unseen dark forces bent on destroying
reality. Before long, Strange is forced to choose between his life of
fortune and status or leave it all behind to defend the world as the
most powerful sorcerer in existence.",
"shortDescription": "Dr. Stephen Strange discovers the world of magic after meeting the Ancient One.",
"topCast": ["Benedict Cumberbatch", "Chiwetel Ejiofor", "Rachel McAdams"],
"directors": ["Scott Derrickson"],
"officialUrl": "http://marvel.com/doctorstrange",
"ratings": [{
"body": "Motion Picture Association of America",
"code": "PG-13"
}],
Well this is indeed JSON format. I suppose the chunk of data you are giving us here are not the complete data. Because there missing some closing brackets. Well if you delete the last comma "," and put there these: "}]".
Then as you can see it passes validation in the jsonlint.
You can try this here: jsonlint.com
I have a very large code block in my .rst file, which I would like to highlight just a small portion of and make it bold. Consider the following rst:
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.wall of text. wall of text. wall of text.
**Example 1: Explain showing a table scan operation**::
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}
When it converts to html, it syntax highlights by default (good), but I also want to specify a few lines that should be bold (the ones with comments on them, but possibly others too.)
I was thinking of adding a trailing character sequence on the line (.e.g. ###) and then writing a post-parser script to modify the html files generated. Is there a better way?
The code-block directive has an emphasize-lines option. The following highlights the lines with comments in your code.
**Example 1: Explain showing a table scan operation**
.. code-block:: python
:emphasize-lines: 7, 11, 12
EXPLAIN FORMAT=JSON
SELECT * FROM Country WHERE continent='Asia' and population > 5000000;
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "53.80" # This query costs 53.80 cost units
},
"table": {
"table_name": "Country",
"access_type": "ALL", # ALL is a table scan
"rows_examined_per_scan": 239, # Accessing all 239 rows in the table
"rows_produced_per_join": 11,
"filtered": "4.76",
"cost_info": {
"read_cost": "51.52",
"eval_cost": "2.28",
"prefix_cost": "53.80",
"data_read_per_join": "2K"
},
"used_columns": [
"Code",
"Name",
"Continent",
"Region",
"SurfaceArea",
"IndepYear",
"Population",
"LifeExpectancy",
"GNP",
"GNPOld",
"LocalName",
"GovernmentForm",
"HeadOfState",
"Capital",
"Code2"
],
"attached_condition": "((`world`.`Country`.`Continent` = 'Asia') and (`world`.`Country`.`Population` > 5000000))"
}
}
}
I have stored the data in the following JSON/XML. Please find below link. I am looking to store the values of des_facet, org_facet, per_facet, geo_facet in my CSV in array. At the moment the values stored in my hash map stores these values in a separate column.
hash = article.attributes.select {|k,v| !["author","images","guid","link"].include?(k) }
hash_new = []
hash.values.map do |v|
hash_new.push("\""+v.to_s+"\"")
end
hash_new.map(&:to_s).join(", ")
Sample JSON:
{
"articles": [{
"results": [{
"title": "Ad Blockers and the Nuisance at the Heart of the Modern Web",
"summary": "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.",
"source": "http://www.nytimes.com/2015/08/20/technology/personaltech/ad-blockers-and-the-nuisance-at-the-heart-of-the-modern-web.html",
"date": "2015-08-20T00:00:00-5:00",
"section": "Technology",
"item_type": "Article",
"updated_date": "2015-08-19T16:05:01-5:00",
"created_date": "2015-08-19T05:00:06-5:00",
"material_type_facet": "News",
"abstract": "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.",
"byline": "By FARHAD MANJOO",
"kicker": "",
"des_facet": ["Online Advertising", "Computers and the Internet", "Data-Mining and Database Marketing", "Privacy", "Advertising and Marketing", "Mobile Applications"],
"org_facet": ["Adblock Plus"],
"per_facet": "",
"geo_facet": ""
}]
}]
}
I want the respective CSV for the same format. Currently below is what I get.
"Ad Blockers and the Nuisance at the Heart of the Modern Web", "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.", "http://www.nytimes.com/2015/08/20/technology/personaltech/ad-blockers-and-the-nuisance-at-the-heart-of-the-modern-web.html", "2015-08-20T00:00:00-5:00", "Technology", "Article", "2015-08-19T16:05:01-5:00", "2015-08-19T05:00:06-5:00", "News", "The adoption of ad-blocking technology is rising steeply. Some see an existential threat to online content as we know it, but others see a new business niche.", "By FARHAD MANJOO", "", "["Online Advertising", "Computers and the Internet", "Data-Mining and Database Marketing", "Privacy", "Advertising and Marketing", "Mobile Applications"]", "["Adblock Plus"]", "", ""
I am not sure how to do this. I am quite new to Ruby. I have thought of using grep probably and look out for value with /[\]]/
You should try to avoid writing CSV yourself, Ruby has a CSV class included that does all escaping for you.
unwanted_attributes = ["author", "images", "guid", "link"]
sanitized_attributes = article.attributes.select { |attribute_name, _|
!unwanted_attributes.include?(attribute_name)
}
csv_string = CSV.generate do |csv|
csv << sanitized_attributes.values
end
I'm new to Ruby and had a question. I'm trying to create a .rb file that converts JSON to CSV.
I came across some disparate sources that got me to make:
require "rubygems"
require 'fastercsv'
require 'json'
csv_string = FasterCSV.generate({}) do |csv|
JSON.parse(File.open("small.json").read).each do |hash|
csv << hash
end
end
puts csv_string
Now, it does in fact output text but they are all squashed together without spaces, commas etc. How do I make it more customised, clear for a CSV file so I can export that file?
The JSON would look like:
{
"results": [
{
"reportingId": "s",
"listingType": "Business",
"hasExposureProducts": false,
"name": "Medeco Medical Centre World Square",
"primaryAddress": {
"geoCodeGranularity": "PROPERTY",
"addressLine": "Shop 9.01 World Sq Shopng Cntr 644 George St",
"longitude": "151.206172",
"suburb": "Sydney",
"state": "NSW",
"postcode": "2000",
"latitude": "-33.876416",
"type": "VANITY"
},
"primaryContacts": [
{
"type": "PHONE",
"value": "(02) 9264 8500"
}
]
},xxx
}
The CSV to just have something like:
reportingId, s, listingType, Business, name, Medeco Medical...., addressLine, xxxxx, longitude, xxxx, latitude, xxxx, state, NSW, postcode, 2000, type, phone, value, (02) 92648544
Since your JSON structure is a mix of hashes and lists, and also has levels of different heights, it is not as trivial as the code you show. However (assuming your input files always look the same) it shouldn't be hard to write an appropriate converter. On the lowest level, you can transform a hash to CSV by
hash.to_a.flatten
E.g.
input = JSON.parse(File.open("small_file.json").read)
writer = FasterCSV.open("out.csv", "w")
writer << input["results"][0]["primaryAddress"].to_a.flatten
will give you
type,VANITY,latitude,-33.876416,postcode,2000,state,NSW,suburb,Sydney,longitude,151.206172,addressLine,Shop 9.01 World Sq Shopng Cntr 644 George St,geoCodeGranularity,PROPERTY
Hope that guides you the direction.
Btw, your JSON looks invalid. You should change the },xxx line to }].