Using beautiful soup to clean up scraped HTML from scrapy - xpath

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath
Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:
scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
which gives me the scrapy shell, inside which I do:
>>> sel.xpath('//h3[#class="gs_rt"]/a').extract()
[
u'<b>Python </b>Paradigms for XML',
u'NCClient: A <b>Python </b>Library for NETCONF Clients',
u'PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments',
u'<b>Python </b>and XML',
u'drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero',
u'XML and <b>Python </b>Tutorial',
u'Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b>',
u'XML Processing with Perl, <b>Python</b>, and PHP',
u'<b>Python </b>& XML',
u'A <b>Python </b>Module for NETCONF Clients'
]
As you can see, this output is raw HTML that needs cleaning. I now have a good sense of how to clean this HTML up. The simplest way is probably to just BeautifulSoup and try something like:
t = sel.xpath('//h3[#class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
This is based off an earlier SO question. The regexp version has been suggested, but I am guessing that BeautifulSoup will be more robust.
I'm a scrapy n00b and can't figure out how to embed this in my spider. I tried
from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup
from scholarscrape.items import ScholarscrapeItem
class ScholarSpider(Spider):
name = "scholar"
allowed_domains = ["scholar.google.com"]
start_urls = [
"http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
]
def parse(self, response):
sel = Selector(response)
item = ScholarscrapeItem()
t = sel.xpath('//h3[#class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
item['title'] = text
return(item)
But that didn't quite work. Any suggestions would be helpful.
Edit 3: Based on suggestions, I have modified my spider file to:
from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup
from scholarscrape.items import ScholarscrapeItem
class ScholarSpider(Spider):
name = "dmoz"
allowed_domains = ["sholar.google.com"]
start_urls = [
"http://scholar.google.com/scholar?q=intitle%3Anine+facts+about+top+journals+in+economics"
]
def parse(self, response):
sel = Selector(response)
item = ScholarscrapeItem()
titles = sel.xpath('//h3[#class="gs_rt"]/a')
for title in titles:
title = item.xpath('.//text()').extract()
print "".join(title)
However, I get the following output:
2014-02-17 15:11:12-0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: scholarscrape)
2014-02-17 15:11:12-0800 [scrapy] INFO: Optional features available: ssl, http11
2014-02-17 15:11:12-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scholarscrape.spiders', 'SPIDER_MODULES': ['scholarscrape.spiders'], 'BOT_NAME': 'scholarscrape'}
2014-02-17 15:11:12-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider opened
2014-02-17 15:11:13-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-17 15:11:13-0800 [dmoz] DEBUG: Crawled (200) <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml> (referer: None)
2014-02-17 15:11:13-0800 [dmoz] ERROR: Spider error processing <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/krishnan/work/research/journals/code/scholarscrape/scholarscrape/spiders/scholar_spider.py", line 20, in parse
title = item.xpath('.//text()').extract()
File "/Library/Python/2.7/site-packages/scrapy/item.py", line 65, in __getattr__
raise AttributeError(name)
exceptions.AttributeError: xpath
2014-02-17 15:11:13-0800 [dmoz] INFO: Closing spider (finished)
2014-02-17 15:11:13-0800 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 247,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 108851,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 196648),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 21701)}
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider closed (finished)
Edit 2: My original question was quite different, but I am now convinced that this is the right way to proceed. Original question (and first edit below):
I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link:
http://scholar.google.com/scholar?q=intitle%3Apython+xpath
Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:
scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
which gives me the scrapy shell, inside which I do:
>>> sel.xpath('string(//h3[#class="gs_rt"]/a)').extract()
[u'Python Paradigms for XML']
As you can see, this only selects the first title, and none of the others on the page. I can't figure out what I should modify my XPath to, so that I select all such elements on the page. Any help is greatly appreciated.
Edit 1: My first approach was to try
>>> sel.xpath('//h3[#class="gs_rt"]/a/text()').extract()
[u'Paradigms for XML', u'NCClient: A ', u'Library for NETCONF Clients',
u'PALSE: ', u'Analysis of Large Scale (Computer) Experiments', u'and XML',
u'drx: ', u'Programming Language [Computers: Programming: Languages: ',
u']-loadaverageZero', u'XML and ', u'Tutorial',
u'Zato\u2014agile ESB, SOA, REST and cloud integrations in ',
u'XML Processing with Perl, ', u', and PHP', u'& XML', u'A ',
u'Module for NETCONF Clients']
The problem with this is approach is that if you look at the actual Google Scholar page, you will see that the first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as scrapy returns. My guess for this behaviour is that 'Python' is trapped inside tags which is why text() is not doing what we want him to do.

This is a really interesting and rather difficult question. The problem you're facing concerns the fact that "Python" in the title is in bold, and it is treated as node, while the rest of the title is simply a text, therefore text() extracts only textual content and not content of <b> node.
Here's my solution. First get all the links:
titles = sel.xpath('//h3[#class="gs_rt"]/a')
then iterate over them and select all textual content of each node, in other words join <b> node with text node for each children of this link
for item in titles:
title = item.xpath('.//text()').extract()
print "".join(title)
This works because in a for loop you will be dealing with textual content of children of each link and thus you will be able to join matching elements. Title in the loop will be equal for instance :[u'Python ', u'Paradigms for XML'] or [u'NCClient: A ', u'Python ', u'Library for NETCONF Clients']

The XPath string() function only returns the string representation of the first node you pass to it.
Just extract nodes normally, don't use string().
sel.xpath('//h3[#class="gs_rt"]/a').extract()
or
sel.xpath('//h3[#class="gs_rt"]/a/text()').extract()

The first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as Scrapy returns.
You need to use normalize-space() that will return blank nodes as well because text() will ignore blank text nodes. So your initial XPath would look like this:
sel.xpath('//h3[#class="gs_rt"]/a').xpath("normalize-space()").extract()
Example:
# HTML has been simplified
from parsel import Selector
html = '''
<span class=gs_ctg2>[PDF]</span> erdincuzun.com
<div class="gs_ri">
<h3 class="gs_rt">
<span class="gs_ctc"><span class="gs_ct1">[PDF]</span><span class="gs_ct2">[PDF]</span></span>
<a href="https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf">Comparison of <b>Python </b>libraries used for Web
data extraction</a>
'''
selector = Selector(HTML)
# get() to get textual data
print("Without normalize-space:\n", selector.xpath('//*[#class="gs_rt"]/a/text()').get())
print("\nWith normalize-space:\n", selector.xpath('//*[#class="gs_rt"]/a').xpath("normalize-space()").get())
"""
Without normalize-space:
Comparison
of
With normalize-space:
Comparison of Python libraries used for Web data extraction
"""
Actual code and example in the online IDE to get titles from Google Scholar Organic results:
import scrapy
class ScholarSpider(scrapy.Spider):
name = "scholar_titles"
allowed_domains = ["scholar.google.com"]
start_urls = ["https://scholar.google.com/scholar?q=intitle%3Apython+xpath"]
def parse(self, response):
for quote in response.xpath('//*[#class="gs_rt"]/a'):
yield {
"title": quote.xpath("normalize-space()").get()
}
Run it:
$ scrapy runspider -O <file_name>.jl <file_name>.py
-O stands for override. -o for appending.
jl is a JSON line file format.
Output with normalize-space():
{"title": "Comparison of Python libraries used for Web data extraction"}
{"title": "Approaching the largest 'API': extracting information from the internet with python"}
{"title": "News crawling based on Python crawler"}
{"title": "A survey on python libraries used for social media content scraping"}
{"title": "Design and Implementation of Crawler Program Based on Python"}
{"title": "DECEPTIVE SECURITY USING PYTHON"}
{"title": "Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others"}
{"title": "Python Paradigms for XML"}
{"title": "Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy"}
{"title": "XML processing with Python"}
Output without normalize-space():
{"title": "Comparison of "}
{"title": "libraries used for Web data extraction"}
{"title": "Approaching the largest 'API': extracting information from the internet with "}
{"title": "News crawling based on "}
{"title": "crawler"}
{"title": "A survey on "}
{"title": "libraries used for social media content scraping"}
{"title": "Design and Implementation of Crawler Program Based on "}
{"title": "DECEPTIVE SECURITY USING "}
{"title": "Hands-On Web Scraping with "}
{"title": ": Perform advanced scraping operations using various "}
{"title": "libraries and tools such as Selenium, Regex, and others"}
{"title": "Paradigms for XML"}
{"title": "Using Web Scraping In A Knowledge Environment To Build Ontologies Using "}
{"title": "And Scrapy"}
{"title": "XML processing with "}
Alternatively, you can achieve it with Google Scholar Organic Results API from SerpApi.
It's a paid API with a free plan. You don't have to figure out the extraction part and maintain it, how to scale it and how to bypass blocks from search engines since it's already done for the end-user.
Example code to integrate:
from serpapi import GoogleScholarSearch
params = {
"api_key": "Your SerpApi API key", # API Key
"engine": "google_scholar", # parsing engine
"q": "intitle:python XPath", # search query
"hl": "en" # language
}
search = GoogleScholarSearch(params) # where extraction happens on the SerpApi back-end
results = search.get_dict() # JSON -> Python dictionary
for result in results["organic_results"]:
title = result["title"]
print(title)
Output:
Comparison of Python libraries used for Web data extraction
Approaching the largest 'API': extracting information from the internet with python
News crawling based on Python crawler
A survey on python libraries used for social media content scraping
Design and Implementation of Crawler Program Based on Python
DECEPTIVE SECURITY USING PYTHON
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Python Paradigms for XML
Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy
XML processing with Python
Disclaimer, I work for SerpApi.

Related

Specifying parameters in yml file for Quarto

I am creating a quarto book project in RStudio to render an html document.
I need to specify some parameters in the yml file but the qmd file returns
"object 'params' not found". Using knitR.
I use the default yml file where I have added params under the book tag
project:
type: book
book:
title: "Params_TEst"
author: "Jane Doe"
date: "15/07/2022"
params:
pcn: 0.1
chapters:
- index.qmd
- intro.qmd
- summary.qmd
- references.qmd
bibliography: references.bib
format:
html:
theme: cosmo
pdf:
documentclass: scrreprt
editor: visual
and the qmd file looks like this
# Preface {.unnumbered}
This is a Quarto book.
To learn more about Quarto books visit <https://quarto.org/docs/books>.
```{r}
1 + 1
params$pcn
When I render the book, or preview the book in Rstudio the error I receive is:
Quitting from lines 8-10 (index.qmd)
Error in eval(expr, envir, enclos) : object 'params' not found
Calls: .main ... withVisible -> eval_with_user_handlers -> eval -> eval
I have experimented placing the params line in the yml in different places but nothing works so far.
Could anybody help?
For multi-page renders, e.g. quarto books, you need to add the YAML to each page, not in the _quarto.yml file
So in your case, each of the chapters that calls a parameter needs a YAML header, like index.qmd, intro.qmd, and summary.qmd, but perhaps not references.qmd.
The YAML header should look just like it does in a standard Rmd. So for example, your index.qmd would look like this:
---
params:
pcn: 0.1
---
# Preface {.unnumbered}
This is a Quarto book.
To learn more about Quarto books visit <https://quarto.org/docs/books>.
```{r}
1 + 1
params$pcn
But, what if you need to change the parameter and re-render?
Then simply pass new parameters to the quarto_render function
quarto::quarto_render(input = here::here("quarto"), #expecting a dir to render
output_format = "html", #output dir is set in _quarto.yml
cache_refresh = TRUE,
execute_params = list(pcn = 0.2))
For now, this only seems to work if you add the parameters to each individual page front-matter YAML.
If you have a large number of pages and need to keep parameters centralized, a workaround is to run a preprocessing script that replaces the parameters in all pages. To add a preprocessing script, add the key pre-render to your _quarto.yml file. The Quarto website has detailed instructions.
For example, if you have N pages named index<N>.qmd, you could have a placeholder in the YML of each page:
---
title: This is chapter N
yourparamplaceholder
---
Your pre-render script could replace yourparamplaceholder with the desired parameters. Here's an example Python script:
for filename in os.listdir(dir):
if filename.endswith(".qmd"):
with open(filename, "r") as f:
txt = f.read()
f.replace('yourparamplaceholder', 'params:\n\tpcn: 0.1\n\tother:20\n')
with open(filename, "w") as ff:
ff.write(txt)
I agree with you that being able to set parameters centrally would be a good idea.

Google Cloud Translation API: Creating glossary error

I tried to test Cloud Translation API using glossary.
So I created a sample glossary file(.csv) and uploaded it on Cloud Storage.
However when I ran my test code (copying sample code from official documentation), an error occurred. It seems that there is a problem in my sample glossary file, but I cannot find it.
I attached my code, error message, and screenshot of the glossary file.
Could you please tell me how to fix it?
And can I use the glossary so that the original language is used when translated into another language?
Ex) Translation English to Korean
I want to visit California. >>> 나는 California에 방문하고 싶다.
Sample Code)
from google.cloud import translate_v3 as translate
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="my_service_account_json_file_path"
def create_glossary(
project_id="YOUR_PROJECT_ID",
input_uri="YOUR_INPUT_URI",
glossary_id="YOUR_GLOSSARY_ID",
):
"""
Create a equivalent term sets glossary. Glossary can be words or
short phrases (usually fewer than five words).
https://cloud.google.com/translate/docs/advanced/glossary#format-glossary
"""
client = translate.TranslationServiceClient()
# Supported language codes: https://cloud.google.com/translate/docs/languages
source_lang_code = "ko"
target_lang_code = "en"
location = "us-central1" # The location of the glossary
name = client.glossary_path(project_id, location, glossary_id)
language_codes_set = translate.types.Glossary.LanguageCodesSet(
language_codes=[source_lang_code, target_lang_code]
)
gcs_source = translate.types.GcsSource(input_uri=input_uri)
input_config = translate.types.GlossaryInputConfig(gcs_source=gcs_source)
glossary = translate.types.Glossary(
name=name, language_codes_set=language_codes_set, input_config=input_config
)
parent = client.location_path(project_id, location)
# glossary is a custom dictionary Translation API uses
# to translate the domain-specific terminology.
operation = client.create_glossary(parent=parent, glossary=glossary)
result = operation.result(timeout=90)
print("Created: {}".format(result.name))
print("Input Uri: {}".format(result.input_config.gcs_source.input_uri))
create_glossary("my_project_id", "file_path_on_my_cloud_storage_bucket", "test_glossary")
Error Message)
Traceback (most recent call last):
File "C:/Users/ME/py-test/translation_api_test.py", line 120, in <module>
create_glossary("my_project_id", "file_path_on_my_cloud_storage_bucket", "test_glossary")
File "C:/Users/ME/py-test/translation_api_test.py", line 44, in create_glossary
result = operation.result(timeout=90)
File "C:\Users\ME\py-test\venv\lib\site-packages\google\api_core\future\polling.py", line 127, in result
raise self._exception
google.api_core.exceptions.GoogleAPICallError: None No glossary entries found in input files. Check your files are not empty. stats = {total_examples = 0, total_successful_examples = 0, total_errors = 3, total_ignored_errors = 3, total_source_text_bytes = 0, total_target_text_bytes = 0, total_text_bytes = 0, text_bytes_by_language_map = []}
Glossary File)
https://drive.google.com/file/d/1RaladmLjgygai3XsZv3Ez4ij5uDH5EdE/view?usp=sharing
I solved my problem by changing encoding of the glossary file to UTF-8.
And I also found that I can use the glossary so that the original language is used when translated into another language.

Scrapy return nothing when crawling AWS blogs site

Here's my attempt to crawl a list of URL in the first page of AWS blogs site.
But it return nothing. I think maybe there's something wrong with my xpath but not sure how to fix.
import scrapy
class AwsblogSpider(scrapy.Spider):
name = 'awsblog'
allowed_domains = ['aws.amazon.com/blogs']
start_urls = ['http://aws.amazon.com/blogs/']
def parse(self, response):
blogs = response.xpath('//li[#class="m-card"]')
for blog in blogs:
url = blog.xpath('.//div[#class="m-card-title"]/a/#href').extract()
print(url)
Attempt 2
import scrapy
class AwsblogSpider(scrapy.Spider):
name = 'awsblog'
allowed_domains = ['aws.amazon.com/blogs']
start_urls = ['http://aws.amazon.com/blogs/']
def parse(self, response):
blogs = response.xpath('//div[#class="aws-directories-container"]')
for blog in blogs:
url = blog.xpath('//li[#class="m-card"]/div[#class="m-card-title"]/a/#href').extract_first()
print(url)
Log output:
2019-11-06 10:38:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-06 10:38:30 [scrapy.core.engine] INFO: Spider opened
2019-11-06 10:38:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-06 10:38:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-06 10:38:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://aws.amazon.com/robots.txt> from <GET http://aws.amazon.com/robots.txt>
2019-11-06 10:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aws.amazon.com/robots.txt> (referer: None)
2019-11-06 10:38:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://aws.amazon.com/blogs/> from <GET http://aws.amazon.com/blogs/>
2019-11-06 10:38:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aws.amazon.com/blogs/> (referer: None)
2019-11-06 10:38:32 [scrapy.core.engine] INFO: Closing spider (finished)
Any help would be greatly appreciated!
You are using the wrong parser, The site is loading blog details via dynamic script function. Check out the page source for understanding the blog content availability.
For fetching the data, you should use dynamic data fetching techniques like below
1. Scrapy splash
2. Selenium

Scrapy view redirect to other page and get <400> error

I am trying to do scrapy view or fetch https://www.watsons.com.sg and the page will be redirected and return <400> error. Wonder if there is anyway to work around it. The log shows something like this:
2018-11-15 22:54:15 [scrapy.core.engine] INFO: Spider opened
2018-11-15 22:54:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-15 22:54:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-15 22:54:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F> from **<GET https://www.watsons.com.sg>
2018-11-15 22:54:16 [scrapy.core.engine] DEBUG: Crawled (400)** <GET https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F> (referer: None)
2018-11-15 22:54:17 [scrapy.core.engine] INFO: Closing spider (finished)
If I use request.get("https://www.watsons.com.sg") its fine. Any idea or comment much appreciated. Thanks.
Okay, so this is one of the weird behaviors of scrapy.
If you look at the location header in the HTTP response (with Firefox developer tools for example), you can see:
location: https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F
Note that there is no / between the .com.sg and the ?.
Looking at how Firefox behaves, on the next request it adds the missing /:
However, somehow scrapy does not do it!
If you look at your logs, when the HTTP 400 error is received, we can see that the / is missing.
This is being discussed in this issue: https://github.com/scrapy/scrapy/issues/1133
For now, the way I go around it is to have my own downloader middleware that normalizes the location header, before the response is being passed in the redirect middleware.
It looks like this:
from scrapy.spiders import Spider
from w3lib.url import safe_download_url
class MySpider(Spider):
name = 'watsons.com.sg'
start_urls = ['https://www.watsons.com.sg/']
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'spiders.myspider.FixLocationHeaderMiddleWare': 650
}
}
def parse(self, response):
pass
class FixLocationHeaderMiddleWare:
def process_response(self, request, response, spider):
if 'location' in response.headers:
response.headers['location'] = safe_download_url(response.headers['location'].decode())
return response

Vision API: How to get JSON-output

I'm having trouble saving the output given by the Google Vision API. I'm using Python and testing with a demo image. I get the following error:
TypeError: [mid:...] + is not JSON serializable
Code that I executed:
import io
import os
import json
# Imports the Google Cloud client library
from google.cloud import vision
from google.cloud.vision import types
# Instantiates a client
vision_client = vision.ImageAnnotatorClient()
# The name of the image file to annotate
file_name = os.path.join(
os.path.dirname(__file__),
'demo-image.jpg') # Your image path from current directory
# Loads the image into memory
with io.open(file_name, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
# Performs label detection on the image file
response = vision_client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description, label.score, label.mid)
with open('labels.json', 'w') as fp:
json.dump(labels, fp)
the output appears on the screen, however I do not know exactly how I can save it. Anyone have any suggestions?
FYI to anyone seeing this in the future, google-cloud-vision 2.0.0 has switched to using proto-plus which uses different serialization/deserialization code. A possible error you can get if upgrading to 2.0.0 without changing the code is:
object has no attribute 'DESCRIPTOR'
Using google-cloud-vision 2.0.0, protobuf 3.13.0, here is an example of how to serialize and de-serialize (example includes json and protobuf)
import io, json
from google.cloud import vision_v1
from google.cloud.vision_v1 import AnnotateImageResponse
with io.open('000048.jpg', 'rb') as image_file:
content = image_file.read()
image = vision_v1.Image(content=content)
client = vision_v1.ImageAnnotatorClient()
response = client.document_text_detection(image=image)
# serialize / deserialize proto (binary)
serialized_proto_plus = AnnotateImageResponse.serialize(response)
response = AnnotateImageResponse.deserialize(serialized_proto_plus)
print(response.full_text_annotation.text)
# serialize / deserialize json
response_json = AnnotateImageResponse.to_json(response)
response = json.loads(response_json)
print(response['fullTextAnnotation']['text'])
Note 1: proto-plus doesn't support converting to snake_case names, which is supported in protobuf with preserving_proto_field_name=True. So currently there is no way around the field names being converted from response['full_text_annotation'] to response['fullTextAnnotation']
There is an open closed feature request for this: googleapis/proto-plus-python#109
Note 2: The google vision api doesn't return an x coordinate if x=0. If x doesn't exist, the protobuf will default x=0. In python vision 1.0.0 using MessageToJson(), these x values weren't included in the json, but now with python vision 2.0.0 and .To_Json() these values are included as x:0
Maybe you were already able to find a solution to your issue (if that is the case, I invite you to share it as an answer to your own post too), but in any case, let me share some notes that may be useful for other users with a similar issue:
As you can check using the the type() function in Python, response is an object of google.cloud.vision_v1.types.AnnotateImageResponse type, while labels[i] is an object of google.cloud.vision_v1.types.EntityAnnotation type. None of them seem to have any out-of-the-box implementation to transform them to JSON, as you are trying to do, so I believe the easiest way to transform each of the EntityAnnotation in labels would be to turn them into Python dictionaries, then group them all into an array, and transform this into a JSON.
To do so, I have added some simple lines of code to your snippet:
[...]
label_dicts = [] # Array that will contain all the EntityAnnotation dictionaries
print('Labels:')
for label in labels:
# Write each label (EntityAnnotation) into a dictionary
dict = {'description': label.description, 'score': label.score, 'mid': label.mid}
# Populate the array
label_dicts.append(dict)
with open('labels.json', 'w') as fp:
json.dump(label_dicts, fp)
There is a library released by Google
from google.protobuf.json_format import MessageToJson
webdetect = vision_client.web_detection(blob_source)
jsonObj = MessageToJson(webdetect)
I was able to save the output with the following function:
# Save output as JSON
def store_json(json_input):
with open(json_file_name, 'a') as f:
f.write(json_input + '\n')
And as #dsesto mentioned, I had to define a dictionary. In this dictionary I have defined what types of information I would like to save in my output.
with open(photo_file, 'rb') as image:
image_content = base64.b64encode(image.read())
service_request = service.images().annotate(
body={
'requests': [{
'image': {
'content': image_content
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 20,
},
{
'type': 'TEXT_DETECTION',
'maxResults': 20,
},
{
'type': 'WEB_DETECTION',
'maxResults': 20,
}]
}]
})
The objects in the current Vision library lack serialization functions (although this is a good idea).
It is worth noting that they are about to release a substantially different library for Vision (it is on master of vision's repo now, although not released to PyPI yet) where this will be possible. Note that it is a backwards-incompatible upgrade, so there will be some (hopefully not too much) conversion effort.
That library returns plain protobuf objects, which can be serialized to JSON using:
from google.protobuf.json_format import MessageToJson
serialized = MessageToJson(original)
You can also use something like protobuf3-to-dict

Resources