I am using BeautifulSoup for scraping website info. Specifically, I want to gather information on patents from a google search (title, inventors, abstract, etc). I have a list of URLs for each patent, but BeautifulSoup is having trouble with certain sites, giving me the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte
Below is the error traceback:
Traceback (most recent call last):
soup = BeautifulSoup(the_page,from_encoding='utf-8')
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
self.parser.close()
File "parser.pxi", line 1209, in lxml.etree._FeedParser.close (src\lxml\lxml.etree.c:90597)
File "parsertarget.pxi", line 142, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99984)
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99807)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:9383)
File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml\lxml.etree.c:95945)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte
I checked the encoding of the site, and it claims to be 'utf-8'. I specified this as an input to BeautifulSoup as well. Below is my code:
import urllib, urllib2
from bs4 import BeautifulSoup
#url = 'https://www.google.com/patents/WO2001019016A1?cl=en' # This one works
url = 'https://www.google.com/patents/WO2006016929A2?cl=en' # This one doesn't work
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Somebody',
'location' : 'Somewhere',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
print response.headers['content-type']
print response.headers.getencoding()
soup = BeautifulSoup(the_page,from_encoding='utf-8')
I included two urls. One results in an error, the other works fine (labeled as such in the comments). In both cases, I could print the html to the terminal fine, but BeautifulSoup consistently crashed.
Any recommendations? This is my first usage of BeautifulSoup.
You should encode the string in UTF-8:
soup = BeautifulSoup(the_page.encode('UTF-8'))
Related
i'm following a tutorial course i'm trying to make a web application in python 3.7 that see if there is any bad words in a text file
the text file contains the following text :
-- Houston, we have a problem. (Apollo 13)
-- Mama always said, life is like a box of chocolates. You never know what you are going to get. (Forrest Gump)
-- You cant handle the truth. (A Few Good Men)
-- I believe everything and I believe nothing. (A Shot in the Dark)
import urllib.request
def read_text():
quotes = open (r'C:\Users\M\Desktop\TEMP FILES\movie_quotes.txt')
content_of_file = quotes.read()
print(content_of_file)
quotes.close()
check_profanity(content_of_file)
def check_profanity(text):
connection = urllib.request.urlopen( 'http://www.wdylike.appspot.com/?q='+text)
output = connection.read()
print(output)
connection.close()
read_text()
and i have that error :
Traceback (most recent call last):
File "C:\Users\Mosa Abbas\Desktop\TEMP FILES\ipnd-starter-code-master\ipnd-starter-code-master\stage_3\lesson_3.3_classes\c_profanity_editor\check_profanity.py", line 32, in <module>
read_text()
File "C:\Users\Mosa Abbas\Desktop\TEMP FILES\ipnd-starter-code-master\ipnd-starter-code-master\stage_3\lesson_3.3_classes\c_profanity_editor\check_profanity.py", line 23, in read_text
check_profanity(content_of_file)
File "C:\Users\Mosa Abbas\Desktop\TEMP FILES\ipnd-starter-code-master\ipnd-starter-code-master\stage_3\lesson_3.3_classes\c_profanity_editor\check_profanity.py", line 27, in check_profanity
connection = urllib.request.urlopen( 'http://www.wdylike.appspot.com/?q='+text)
File "C:\Users\Mosa Abbas\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Mosa Abbas\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\Mosa Abbas\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\Mosa Abbas\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\Mosa Abbas\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\Mosa Abbas\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
what is that error and what i should do to get rid of it
urllib.error.HTTPError: HTTP Error 400: Bad Request
i'v tried to add this line of code :
url = 'http://www.wdylike.appspot.com/?q='+urllib.parse.quote(text, safe = '/')
and then:
connection = urllib.request.urlopen(url)
and i got
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
i'm a beginner in python and programming web applications so please help :(
When sending a HTTP request through a GET method, it is necessary to escape the GET vars.
So you should probably escape your text before appending it to the URL.
Have a look to question #1695183
i found the solution
I think that happens be because i'm not encoding the string before appending it to the url.
in python3 we should do the following to 'text' before appending it to the url:
text_to_check = urllib.parse.quote_plus(text)
Python2 would be something like this
(urllib was broken into smaller components in python3):
text_to_check = urllib.quote_plus(text_to_check)
This means that, when appending a string with whitespace to the url it will appear as something like "Am+I+cursing%3F" instead of "Am I cursing?".
Full check_profanity() example:
def check_profanity(text_to_check):
text_to_check = urllib.parse.quote_plus(text_to_check)
connection = urlopen(
"http://www.wdylike.appspot.com/?q=" + text_to_check)
output = connection.read()
print(output)
connection.close()
I am trying to use gspread, but I need the library to mesh well with another async library I am using.
After digging through the docs for gspread, I found this function that I can use:
class gspread.Client(auth, session=None)
An instance of this class communicates with Google API.
Parameters:
auth – An OAuth2 credential object. Credential objects are those created by the oauth2client library. https://github.com/google/oauth2client
session – (optional) A session object capable of making HTTP requests while persisting some parameters across requests. Defaults to requests.Session.
Which gives me an optional session parameter. How would I specify the session to use aiohttp?
I wrote a bit of test code, which compiles fine, but running the code crashes.
import aiohttp
import gspread
import random
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
c = gspread.authorize(creds)
client = gspread.Client(auth=c, session=aiohttp.ClientSession)
sheet = client.open_by_key('1Hkwo9gSpk3NjgPLPkG8kh0zBNw2nxsYWRw0cVdn0JA0')
ws = sheet.get_worksheet(0)
rcount = ws.row_count
msg = ws.cell(random.randint(1,rcount),1).value
print(msg)
The error message I get is below:
Traceback (most recent call last):
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd_launcher.py", line 118, in <module>
vspd.debug(filename, port_num, '', '', run_as)
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\debugger.py", line 37, in debug
run(address, filename, *args, **kwargs)
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\_local.py", line 79, in run_file
run(argv, addr, **kwargs)
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\_local.py", line 140, in _run
_pydevd.main()
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\_vendored\pydevd\pydevd.py", line 1751, in main
debugger.connect(host, port)
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\_vendored\pydevd\pydevd.py", line 1107, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\_vendored\pydevd\pydevd.py", line 1114, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "c:\Users\xxxxx\.vscode\extensions\ms-python.python-2018.9.0\pythonFiles\experimental\ptvsd\ptvsd\_vendored\pydevd\_pydev_imps\_pydev_execfile.py", line 25, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "c:\Users\xxxxx\Desktop\Coding\Discord Bot\Testing\test.py", line 14, in <module>
ws = sheet.get_worksheet(0)
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\site-packages\gspread\models.py", line 141, in get_worksheet
sheet_data = self.fetch_sheet_metadata()
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\site-packages\gspread\models.py", line 123, in fetch_sheet_metadata
r = self.client.request('get', url, params=params)
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\site-packages\gspread\client.py", line 73, in request
headers=headers
TypeError: get() missing 1 required positional argument: 'url'
PS C:\Users\xxxxx\Desktop\Coding\Discord Bot>
Any ideas?
It's impossible to use aiohttp with current version of gspread (3.0.1). the gspread library uses synchronous calls and aiohttp uses asynchronous calls.
Please, reconsider to use compatible library like requests or httplib2.
For anyone wondering, once I posted this question, I found that someone made an async wrapper for gspread. Check out the library here and show this guy your appreciation. I sure am!
I'm trying to receive notifications on my (Android) mobile device from an ESP8266 MCU running MicroPython. For this reason I subscribed to a couple of online services exposing some APIs for this task, Pushbullet, and Pushed, and I installed the matching apps on my device.
This is what I'm trying:
Pushbullet:
import json
import urequests
body = "Test Notification"
title = "Pushbullet"
data_sent = {"type": "note", "title": title, "body": body}
API_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxxx'
pb_headers = {
'Authorization': 'Bearer ' + API_KEY,
'Content-Type': 'application/json'
}
r = urequests.post(
'https://api.pushbullet.com/v2/pushes',
data=json.dumps(data_sent),
headers=pb_headers
)
print(r)
Error:
ssl_handshake_status: -256
Traceback (most recent call last):
File "<stdin>", line 11, in <module>
File "urequests.py", line 104, in post
File "urequests.py", line 56, in request
OSError: [Errno 5] EIO
Pushed:
import json
import urequests
payload = {
"app_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxx",
"app_secret": "xxxxxxxxxxxxxxxxxxxxxxxxxxx",
"target_type": "app",
"content": "Remote Mic MCU test from ESP8266"
}
r = urequests.post("https://api.pushed.co/1/push", data=payload)
print(r)
Error:
Traceback (most recent call last):
File "<stdin>", line 8, in <module>
File "urequests.py", line 104, in post
File "urequests.py", line 74, in request
TypeError: object with buffer protocol required
Searching for these errors, doesn't get me anywhere useful.
The exact same code snippets work OK on my Linux box (using requests instead of urequests), but I understand that urequests may have some limitations.
Do you have any hint on how to fix this?
The exception message suggests that you pass the type of data which urequests doesn't expect. From my knowledge of how HTTP POST works (see HTTP standard), I know that it accepts octet stream, which in Python would be represented by str or bytes type. Whereas you pass a dictionary.
`
problem is with urequests library, if you can switchback to usocket and close the socket before timeout then works fine. For urequests you need to have a board with spiram.
I am currently trying to overwrite and google spreadsheet with new data using gspread api (version 0.4.1) with sheet.update_cells but it keeps giving me 502 with err msg as follows:
The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds. <ins>That’s all we know.
It seems to be http session problem from the stacktrace:
File "/usr/local/lib/python2.7/dist-packages/gspread/models.py", line 476, in update_cells
self.client.post_cells(self, ElementTree.tostring(feed))
File "/usr/local/lib/python2.7/dist-packages/gspread/client.py", line 303, in post_cells
r = self.session.post(url, data, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/gspread/httpsession.py", line 81, in post
return self.request('POST', url, data=data, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/gspread/httpsession.py", line 67, in request
response = func(url, data=data, headers=request_headers)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 111, in post
return request('post', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 57, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 475, in request
I did a little bit investigation but it seems that the answer sort of varies, so i think i should better just my version here.
The code snippet is really nothing special like something as follows:
sheet = gspread.authorize(credentials).open_by_key(spreadsheet_key).worksheet(worksheet_title)
if not sheet:
return
if not len(new_rows):
return
sheet.resize(len(new_rows), sheet.col_count)
active_range = 'A1:{0}{1}'.format(last_col, len(new_rows))
cell_list = sheet.range(active_range)
k = 0
for row in new_rows:
for field in row:
cell_list[k].value = field
k+=1
sheet.update_cells(cell_list)
where my new_rows are just the new cell value i want to overwrite the sheet with. I don't think it is an authentication issue, as the same code snippet used to work but somehow at some point it keeps giving the 502.
I have the following code, reads oauth2 token form file, then try's to perform a doc's list query to find a specific spreadsheet that I want to copy, however no matter what I try the code either errors out or returns with an object containing no document data.
I am using gdata.docs.client.DocsClient which as far as I can tell is version 3 of the API
def CreateClient():
"""Create a Documents List Client."""
client = gdata.docs.client.DocsClient(source=config.APP_NAME)
client.http_client.debug = config.DEBUG
# Authenticate the user with CLientLogin, OAuth, or AuthSub.
if os.path.exists(config.CONFIG_FILE):
f = open(config.CONFIG_FILE)
tok = pickle.load(f)
f.close()
client.auth_token = tok.auth_token
return client
1st query attempt
def get_doc():
new_api_query = gdata.docs.client.DocsQuery(title='RichSheet', title_exact=True, show_collections=True)
d = client.GetResources(q = new_api_query)
this fails with the following stack trace
Traceback (most recent call last):
File "/Users/richard/PycharmProjects/reportone/make_my_report.py", line 83, in <module>
get_doc()
File "/Users/richard/PycharmProjects/reportone/make_my_report.py", line 57, in get_doc
d = client.GetResources(q = new_api_query)
File "/Users/richard/PycharmProjects/reportone/gdata/docs/client.py", line 151, in get_resources
**kwargs)
File "/Users/richard/PycharmProjects/reportone/gdata/client.py", line 640, in get_feed
**kwargs)
File "/Users/richard/PycharmProjects/reportone/gdata/docs/client.py", line 66, in request
return super(DocsClient, self).request(method=method, uri=uri, **kwargs)
File "/Users/richard/PycharmProjects/reportone/gdata/client.py", line 267, in request
uri=uri, auth_token=auth_token, http_request=http_request, **kwargs)
File "/Users/richard/PycharmProjects/reportone/atom/client.py", line 115, in request
self.auth_token.modify_request(http_request)
File "/Users/richard/PycharmProjects/reportone/gdata/gauth.py", line 1047, in modify_request
token_secret=self.token_secret, verifier=self.verifier)
File "/Users/richard/PycharmProjects/reportone/gdata/gauth.py", line 668, in generate_hmac_signature
next, token, verifier=verifier)
File "/Users/richard/PycharmProjects/reportone/gdata/gauth.py", line 629, in build_oauth_base_string
urllib.quote(params[key], safe='~')))
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1266, in quote
if not s.rstrip(safe):
AttributeError: 'bool' object has no attribute 'rstrip'
Process finished with exit code 1
then my second attempt
def get_doc():
other = gdata.docs.service.DocumentQuery(text_query='RichSheet')
d = client.GetResources(q = other)
this returns an ResourceFeed object, but has no content. I have been through the source code for these function but thing are not any obvious.
Have i missed something ? or should i go back to version 2 of the api ?