Unique Key Error when exporting Scrapy data to Elasticsearch

Unique Key Error when exporting Scrapy data to Elasticsearch - elasticsearch

I'm attempting to use the scrapy elasticsearch pipeline (here: https://github.com/knockrentals/scrapy-elasticsearch) to put data into elasticsearch. however i get the following error, i'm aware that it's related to the ELASTICSEARCH_UNIQ_KEY value that is currently set at 'url' but i have no idea what it should be set to.
Similar posts on here recommend solutions that involve creating a field for the unique key but i don't understand what this means.
Here's my error message:
2015-08-05 11:34:40 [scrapy] ERROR: Error processing {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/192357732/'],
'title': [u'Suchen in der vernetzten Welt']}
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item
self.index_item(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item
local_id = hashlib.sha1(item[uniq_key]).hexdigest()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/item.py", line 56, in __getitem__
return self._values[key]
KeyError: 'url'

Similar posts on here recommend solutions that involve creating a
field for the unique key but i don't understand what this means.
Declare a field in your Item with the name you configured in the ELASTICSEARCH_UNIQ_KEY.
import scrapy
class DemoItem(scrapy.Item):
url = scrapy.Field() # ELASTICSEARCH_UNIQ_KEY
class DemoSpider(scrapy.Spider):
name = 'demo'
start_urls = ['http://www.example.com']
def parse(self, response):
demoItem = DemoItem()
demoItem['url'] = response.url
yield demoItem

Related

How to provide the "result.has_errors()" and "result.has_validation_errors()" attributes when using import_data method for Django import_export

I need to make changes to a csv file being imported using the module import_Export for Django. I implement the import_data method for this but get the error 'Dataset' object has no attribute 'has_errors'
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\site-packages\django\core\handlers\exception.py", line 34, in inner
response = get_response(request)
File "C:\Program Files\Python36\lib\site-packages\django\core\handlers\base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "C:\Program Files\Python36\lib\site-packages\django\core\handlers\base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "C:\Program Files\Python36\lib\site-packages\django\utils\decorators.py", line 130, in _wrapped_view
response = view_func(request, *args, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\django\views\decorators\cache.py", line 44, in _wrapped_view_func
response = view_func(request, *args, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\django\contrib\admin\sites.py", line 231, in inner
return view(request, *args, **kwargs)
File "C:\Program Files\Python36\lib\site-packages\import_export\admin.py", line 316, in import_action
if not result.has_errors() and not result.has_validation_errors():
Exception Type: AttributeError at /admin/engine/mtg/import/
Exception Value: 'Dataset' object has no attribute 'has_errors'
How can I provide the result.has_error() and result.has_validation_errors() when I return my customized dataset to avoid this error? Here is where I implement the method
Admin
class ModelResource(resources.ModelResource):
def import_data(self, dataset, dry_run=False, raise_errors=True,use_transactions=None, collect_failed_rows=False, **kwargs):
new_dataset = do_stuff(dataset)
return new_dataset
The documentation states that the first thing that the import_data method does is create a "result" instance containing error information. I'm assuming this is what I need maybe? But I don't know how to gain access to it or return it with my new dataset[Import Data workflow][1]
[1]: https://django-import-export.readthedocs.io/en/latest/import_workflow.html?highlight=before)import

You shouldn't need to override has_errors() and has_validation_errors() because the logic should be handled for you.
Generally you shouldn't be over-riding import_data() because this is where the import logic happens. Pass a valid Dataset object into import_data() as the first arg. I suggest having a quick look at the source, because this will clarify what is going on.
If you need to modify the imported data, then there are several hooks you can use. This is where you subclass the base Resource and add your own logic.
This example is based on the django-import-export example app:
class BookResource(resources.ModelResource):
def before_import_row(self, row, row_number=None, **kwargs):
"""
Override to add additional logic.
"""
pass
def before_save_instance(self, instance, using_transactions, dry_run):
"""
Override to add additional logic.
"""
pass
class Meta:
model = Book
fields = ('id', 'author_email', 'price')
Then you can call this with:
rows = [
('book1', 'email#example.com', '10.25'),
('book2', 'email#example.com', '10.25'),
('book1', 'email#example.com', '10.25'),
]
dataset = tablib.Dataset(*rows, headers=['name', 'author_email', 'price'])
book_resource = BookResource()
result = book_resource.import_data(dataset)
print(result.totals)

Get metadata for a column with google Sheets API

I have a Google spreadsheet that I am connecting to and interacting with using the google-python-api-client package. Following this description on metadata search, and the links in it for the request body, I have written a function to get metadata for a range:
def get_metadata_by_range(range_: Union[dict, str]) -> dict:
if isinstance(range_, str):
print("String range: ", range_)
request_body = {"dataFilters": \
{"a1Range": range_}}
elif isinstance(range_, dict):
print("Dict range: ", range_)
request_body = {"dataFilters": \
[{"gridRange": range_}]}
else:
return None
request = service.spreadsheets().developerMetadata().\
search(spreadsheetId=SPREADSHEET_ID, body=request_body)
return request.execute()
Calling this with a range, either A1 notation or a gridRange will cause an error to occur though. For example, calling it with this line get_metadata_by_range("Metadata!A:A") will cause the following traceback.
String range: Metadata!A:A
Traceback (most recent call last):
File "oqc_server/fab/gapc.py", line 82, in <module>
get_metadata_by_range("Metadata!A:A")
File "oqc_server/fab/gapc.py", line 69, in get_metadata_by_range
return request.execute()
File "/media/kajsa/Storage/Projects/oqc_server/venv/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/media/kajsa/Storage/Projects/oqc_server/venv/lib/python3.7/site-packages/googleapiclient/http.py", line 856, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 500 when requesting https://sheets.googleapis.com/v4/spreadsheets/1RhheCsI3kHrm8yK2Yio2kAOU4VOzYdz-eK0vjiMY7co/developerMetadata:search?alt=json returned "Internal error encountered."
Any ideas on what is causing this and how to solve it?

You want to search and retrieve the developer metadata from the range using the method of Method: spreadsheets.developerMetadata.search of Sheets API.
You want to achieve this using google-api-python-client with python.
You have already been able to get and put values for Spreadsheet with Sheets API.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Modification points:
When you want to search the developer metadata with the range, please set the gridrange to dataFilters[].developerMetadataLookup.metadataLocation.dimensionRange.
When the range is set to dataFilters[].a1Range and dataFilters[].gridRange, I could confirm that the same error occurs.
Sample script:
The sample script for retrieving the developer metadata from the range is as follows. Before you use this, please set the variables of spreadsheet_id and sheet_id.
service = build('sheets', 'v4', credentials=creds)
spreadsheet_id = '###' # Please set the Spreadsheet ID.
sheet_id = ### # Please set the sheet ID.
search_developer_metadata_request_body = {
"dataFilters": [
{
"developerMetadataLookup": {
"metadataLocation": {
"dimensionRange": {
"sheetId": sheet_id,
"dimension": "COLUMNS",
"startIndex": 0,
"endIndex": 1
}
}
}
}
]
}
request = service.spreadsheets().developerMetadata().search(
spreadsheetId=spreadsheet_id, body=search_developer_metadata_request_body)
response = request.execute()
print(response)
Above script retrieves the developer metadata from the column "A" of sheet_id.
Note:
Please modify above script for your actual script.
In the current stage, the Developer Metadata can be added to the Spreadsheet, each sheet in the Spreadsheet and row and column. Please be careful this. Ref
References:
Method: spreadsheets.developerMetadata.search
Adding Developer Metadata- DeveloperMetadataLookup
If I misunderstood your question and this was not the direction you want, I apologize.

There is a bug with developerMetadata related to a1Range objects being passed as filters.
Edit
I've checked the bug again and a fix has been implemented.

Django REST API: Queryset with arguments

folks: I am trying to get the details of an entry in the database table through Django REST Framework with "url arguments" I am using the following code to get the value of "symbol" parameter however, I am getting the runtime error on the command line related to base_name. Here is the detail of my code:
class ticker_detail_full_view(viewsets.ModelViewSet):
serializer_class = ticker_detail_full
def get_queryset(self):
ticker = self.kwargs['symbol']
return ticker.objects.filter(symbol = ticker)
url: http://localhost:8000/qres/ticker_detail_full/?symbol=AMZN
I am getting the following error on the Django command-line:
router.register('ticker_detail_full', views.ticker_detail_full_view)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rest_framework/routers.py",
line 82, in register
base_name = self.get_default_base_name(viewset)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rest_framework/routers.py",
line 161, in get_default_base_name
assert queryset is not None, 'base_name argument not specified, and could ' \
AssertionError: base_name argument not specified, and could not
automatically determine the name from the viewset, as it does not have
a .queryset attribute.

How can I use scrapy to fetch the data through proxy, at the same time I do not use middlewares

I am using scrapy to fetch the data, I bought some proxies. I read the scrapy doc, I found that if I want to use proxy, I should write middlewares. Now my question is how can I use proxy without middlewares? I tried like this:
def start_requests(self):
name = my_name
password = password
proxy = my proxy
return[
FormRequest(url,formate={'account':my_name,'password':password},
meta={'proxy':proxy},
callback=self.after_login)
]
def after_login(self, response):
response.xpath(.....)
I got TypeError:argument of type 'NoneType' is not iterable
Here is the Traceback:
Traceback (most recent call last):
File "/home/think/sjs-project/sjs/venv/local/lib/python2.7/site-packages/twisted/internet/endpoints.py", line 542, in connect
timeout=self._timeout, bindAddress=self._bindAddress)
File "/home/think/sjs-project/sjs/venv/local/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 482, in connectTCP
c = tcp.Connector(host, port, factory, timeout, bindAddress, self)
File "/home/think/sjs-project/sjs/venv/local/lib/python2.7/site-packages/twisted/internet/tcp.py", line 1165, in __init__
if abstract.isIPv6Address(host):
File "/home/think/sjs-project/sjs/venv/local/lib/python2.7/site-packages/twisted/internet/abstract.py", line 522, in isIPv6Address
if '%' in addr:
TypeError: argument of type 'NoneType' is not iterable
What am I doing wrong? Thanks very much.

gdata.docs.client.DocsClient

I have the following code, reads oauth2 token form file, then try's to perform a doc's list query to find a specific spreadsheet that I want to copy, however no matter what I try the code either errors out or returns with an object containing no document data.
I am using gdata.docs.client.DocsClient which as far as I can tell is version 3 of the API
def CreateClient():
"""Create a Documents List Client."""
client = gdata.docs.client.DocsClient(source=config.APP_NAME)
client.http_client.debug = config.DEBUG
# Authenticate the user with CLientLogin, OAuth, or AuthSub.
if os.path.exists(config.CONFIG_FILE):
f = open(config.CONFIG_FILE)
tok = pickle.load(f)
f.close()
client.auth_token = tok.auth_token
return client
1st query attempt
def get_doc():
new_api_query = gdata.docs.client.DocsQuery(title='RichSheet', title_exact=True, show_collections=True)
d = client.GetResources(q = new_api_query)
this fails with the following stack trace
Traceback (most recent call last):
File "/Users/richard/PycharmProjects/reportone/make_my_report.py", line 83, in <module>
get_doc()
File "/Users/richard/PycharmProjects/reportone/make_my_report.py", line 57, in get_doc
d = client.GetResources(q = new_api_query)
File "/Users/richard/PycharmProjects/reportone/gdata/docs/client.py", line 151, in get_resources
**kwargs)
File "/Users/richard/PycharmProjects/reportone/gdata/client.py", line 640, in get_feed
**kwargs)
File "/Users/richard/PycharmProjects/reportone/gdata/docs/client.py", line 66, in request
return super(DocsClient, self).request(method=method, uri=uri, **kwargs)
File "/Users/richard/PycharmProjects/reportone/gdata/client.py", line 267, in request
uri=uri, auth_token=auth_token, http_request=http_request, **kwargs)
File "/Users/richard/PycharmProjects/reportone/atom/client.py", line 115, in request
self.auth_token.modify_request(http_request)
File "/Users/richard/PycharmProjects/reportone/gdata/gauth.py", line 1047, in modify_request
token_secret=self.token_secret, verifier=self.verifier)
File "/Users/richard/PycharmProjects/reportone/gdata/gauth.py", line 668, in generate_hmac_signature
next, token, verifier=verifier)
File "/Users/richard/PycharmProjects/reportone/gdata/gauth.py", line 629, in build_oauth_base_string
urllib.quote(params[key], safe='~')))
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1266, in quote
if not s.rstrip(safe):
AttributeError: 'bool' object has no attribute 'rstrip'
Process finished with exit code 1
then my second attempt
def get_doc():
other = gdata.docs.service.DocumentQuery(text_query='RichSheet')
d = client.GetResources(q = other)
this returns an ResourceFeed object, but has no content. I have been through the source code for these function but thing are not any obvious.
Have i missed something ? or should i go back to version 2 of the api ?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unique Key Error when exporting Scrapy data to Elasticsearch - elasticsearch

Related

How to provide the "result.has_errors()" and "result.has_validation_errors()" attributes when using import_data method for Django import_export

Get metadata for a column with google Sheets API

Django REST API: Queryset with arguments

How can I use scrapy to fetch the data through proxy, at the same time I do not use middlewares

gdata.docs.client.DocsClient

Categories

Resources