HTML scraping td content via XPath yields no data - xpath

I'm trying to extract some data from my college website for a project. This is my code. But item's fields contain no data.
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import scrapy
from vasavi.items import VasaviItem
class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['domainsite']
login_page = 'domainsite/index.aspx'
start_urls = ['domainsite/My_Info.aspx']
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'txtLoginID': 'srichakra', 'txtPWD': '12345'},
callback=self.check_login_response)
def check_login_response(self, response):
if "SRI CHAKRA GOUD" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
def parse(self, response):
print "Parsing"
item = VasaviItem()
ur = response.url
print ur
item['rollno'] = response.xpath('//*[#id="divStudInfo"]/table/tbody/tr[2]/td[1]/text()').extract()
item['name'] = response.css('#divStudInfo > table > tbody > tr:nth-child(3) > td:nth-child(2)::text').extract()
item['Marks'] = response.xpath('//*[#id="divStudySummary"]/table/tbody/tr[3]/td[9]/a/text()').extract()
yield item
I was not allowed to post more than 2 urls here so I replaced all http://www.domain.com with domainsite
Output:
2015-01-03 18:45:06+0530 [myspider] INFO: Spider opened
2015-01-03 18:45:06+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-03 18:45:06+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-03 18:45:06+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-03 18:45:07+0530 [myspider] DEBUG: Crawled (200) <GET domainsite> (referer: None)
2015-01-03 18:45:09+0530 [myspider] DEBUG: Redirecting (302) to <GET domainsite/My_Info.aspx> from <POST domainsite/index.aspx>
2015-01-03 18:45:15+0530 [myspider] DEBUG: Crawled (200) <GET domainsite/My_Info.aspx> (referer: domainsite/index.aspx)
2015-01-03 18:45:15+0530 [myspider] DEBUG: Successfully logged in. Let's start crawling!
2015-01-03 18:45:21+0530 [myspider] DEBUG: Crawled (200) <GET domainsite/My_Info.aspx>(referer: domainsite/My_Info.aspx)
Parsing
domainsite/My_Info.aspx
2015-01-03 18:45:21+0530 [myspider] DEBUG: Scraped from <200 domainsite/My_Info.aspx>
{'rollno': [], 'Marks': [], 'name': []}
2015-01-03 18:45:21+0530 [myspider] INFO: Closing spider (finished)
2015-01-03 18:45:21+0530 [myspider] INFO: Stored json feed (1 items) in: vce.json
2015-01-03 18:45:21+0530 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1370,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 92491,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 3, 13, 15, 21, 528000),
'item_scraped_count': 1,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 1, 3, 13, 15, 6, 518000)}
2015-01-03 18:45:21+0530 [myspider] INFO: Spider closed (finished)

As other commenters have noted, you really need to show the HTML input. If I had to guess though, I'd say tbody is not really present on the page - see e.g. this question or this question. tbody is present in the two path expression you show and also in the CSS.
To test this hypothesis, skip the tbody element in the expressions:
item['rollno'] = response.xpath('//*[#id="divStudInfo"]/table//tr[2]/td[1]/text()').extract()

Related

Looking for a string inside a register

I’m running an API call in order to get indexes in Elastic – The output is saved inside a registry (index_list).
I want to run an additional job that runs only in case the policy name can be found in the register output – The solution I gave is not working (I’m guessing that I need to loop inside the registry on specific fields ) -
in the example below i have a policy ID named what
API Call
- name: Get ISM_retention policy
uri:
url: http://elastic:9200/_opendistro/_ism/policies
method: GET
timeout: 180
body_format: json
register: index_list
The debug condition that I’m trying to run –
In the register output you can find the u'policy_id': u'what'
- debug:
msg: hello
when: “'what' in index_list”
The register output - The policy ID is marked
MSG:
{u'status': 200, u'content_length': u'1919', u'cookies': {}, u'url': u'http://elastic:9200/_opendistro/_ism/policies', u'changed': False, u'elapsed': 0, u'failed': False, u'json': {u'total_policies': 3, u'policies': [{u'policy': {u'default_state': u'hot', u'description': u'policy for delete index', u'last_updated_time': 1667984798760, u'error_notification': None, u'states': [{u'transitions': [{u'conditions': {u'min_index_age': u'1d'}, u'state_name': u'delete'}], u'name': u'hot', u'actions': [{u'retry': {u'count': 3, u'delay': u'1m', u'backoff': u'exponential'}, u'open': {}}]}, {u'transitions': [], u'name': u'delete', u'actions': [{u'retry': {u'count': 3, u'delay': u'1m', u'backoff': u'exponential'}, u'delete': {}}]}], u'ism_template': [{u'index_patterns': [u'audit-'], u'priority': 100, u'last_updated_time': 1667984798760}], u'schema_version': 15, u'policy_id': u'policy_1'}, u'_id': u'policy_1', u'_seq_no': 104149, u'_primary_term': 1}, {u'policy': {u'default_state': u'hot', u'description': u'kuku index retenation flow', u'last_updated_time': 1668061803458, u'error_notification': None, u'states': [{u'transitions': [{u'conditions': {u'min_index_age': u'1d'}, u'state_name': u'delete'}], u'name': u'hot', u'actions': [{u'retry': {u'count': 3, u'delay': u'1m', u'backoff': u'exponential'}, u'open': {}}]}, {u'transitions': [], u'name': u'delete', u'actions': [{u'retry': {u'count': 3, u'delay': u'1m', u'backoff': u'exponential'}, u'delete': {}}]}], u'ism_template': [{u'index_patterns': [u'kuku-'], u'priority': 100, u'last_updated_time': 1668061803458}], u'schema_version': 15, u'policy_id': u'policy_kuku'}, u'_id': u'policy_kuku', u'_seq_no': 143284, u'_primary_term': 1}, {u'policy': {u'default_state': u'hot', u'description': u'what index retenation flow', u'last_updated_time': 1668074528411, u'error_notification': None, u'states': [{u'transitions': [{u'conditions': {u'min_index_age': u'1d'}, u'state_name': u'delete'}], u'name': u'hot', u'actions': [{u'retry': {u'count': 3, u'delay': u'1m', u'backoff': u'exponential'}, u'open': {}}]}, {u'transitions': [], u'name': u'delete', u'actions': [{u'retry': {u'count': 3, u'delay': u'1m', u'backoff': u'exponential'}, u'delete': {}}]}], u'ism_template': [{u'index_patterns': [u'what-*'], u'priority': 100, u'last_updated_time': 1668074528411}], u'schema_version': 15, u'policy_id': u'what'}, u'_id': u'what', u'_seq_no': 150078, u'_primary_term': 1}]}, u'content_type': u'application/json; charset=UTF-8', u'msg': u'OK (1919 bytes)', u'redirected': False, u'cookies_string': u''}
Given the UTF-8 data in the file for testing
shell> cat index_list.json
{status: 200, content_length: 1919, cookies: {}, url: http://elastic:9200/_opendistro/_ism/policies, changed: False, elapsed: 0, failed: False, json: {total_policies: 3, policies: [{policy: {default_state: hot, description: policy for delete index, last_updated_time: 1667984798760, error_notification: None, states: [{transitions: [{conditions: {min_index_age: 1d}, state_name: delete}], name: hot, actions: [{retry: {count: 3, delay: 1m, backoff: exponential}, open: {}}]}, {transitions: [], name: delete, actions: [{retry: {count: 3, delay: 1m, backoff: exponential}, delete: {}}]}], ism_template: [{index_patterns: [audit-], priority: 100, last_updated_time: 1667984798760}], schema_version: 15, policy_id: policy_1}, _id: policy_1, _seq_no: 104149, _primary_term: 1}, {policy: {default_state: hot, description: kuku index retenation flow, last_updated_time: 1668061803458, error_notification: None, states: [{transitions: [{conditions: {min_index_age: 1d}, state_name: delete}], name: hot, actions: [{retry: {count: 3, delay: 1m, backoff: exponential}, open: {}}]}, {transitions: [], name: delete, actions: [{retry: {count: 3, delay: 1m, backoff: exponential}, delete: {}}]}], ism_template: [{index_patterns: [kuku-], priority: 100, last_updated_time: 1668061803458}], schema_version: 15, policy_id: policy_kuk}, _id: policy_kuk, _seq_no: 143284, _primary_term: 1}, {policy: {default_state: hot, description: what index retenation flow, last_updated_time: 1668074528411, error_notification: None, states: [{transitions: [{conditions: {min_index_age: 1d}, state_name: delete}], name: hot, actions: [{retry: {count: 3, delay: 1m, backoff: exponential}, open: {}}]}, {transitions: [], name: delete, actions: [{retry: {count: 3, delay: 1m, backoff: exponential}, delete: {}}]}], ism_template: [{index_patterns: [what-*], priority: 100, last_updated_time: 1668074528411}], schema_version: 15, policy_id: what}, _id: what, _seq_no: 150078, _primary_term: 1}]}, content_type: application/json; charset=UTF-8, msg: OK (1919 bytes), redirected: False, cookies_string: }
Read the file and create the dictionary index_list
- include_vars:
file: index_list.json
name: index_list
gives
index_list:
changed: false
content_length: 1919
content_type: application/json; charset=UTF-8
cookies: {}
cookies_string: null
elapsed: 0
failed: false
json:
policies:
- _id: policy_1
_primary_term: 1
_seq_no: 104149
policy:
default_state: hot
description: policy for delete index
error_notification: None
ism_template:
- index_patterns:
- audit-
last_updated_time: 1667984798760
priority: 100
last_updated_time: 1667984798760
policy_id: policy_1
schema_version: 15
states:
- actions:
- open: {}
retry:
backoff: exponential
count: 3
delay: 1m
name: hot
transitions:
- conditions:
min_index_age: 1d
state_name: delete
- actions:
- delete: {}
retry:
backoff: exponential
count: 3
delay: 1m
name: delete
transitions: []
- _id: policy_kuk
_primary_term: 1
_seq_no: 143284
policy:
default_state: hot
description: kuku index retenation flow
error_notification: None
ism_template:
- index_patterns:
- kuku-
last_updated_time: 1668061803458
priority: 100
last_updated_time: 1668061803458
policy_id: policy_kuk
schema_version: 15
states:
- actions:
- open: {}
retry:
backoff: exponential
count: 3
delay: 1m
name: hot
transitions:
- conditions:
min_index_age: 1d
state_name: delete
- actions:
- delete: {}
retry:
backoff: exponential
count: 3
delay: 1m
name: delete
transitions: []
- _id: what
_primary_term: 1
_seq_no: 150078
policy:
default_state: hot
description: what index retenation flow
error_notification: None
ism_template:
- index_patterns:
- what-*
last_updated_time: 1668074528411
priority: 100
last_updated_time: 1668074528411
policy_id: what
schema_version: 15
states:
- actions:
- open: {}
retry:
backoff: exponential
count: 3
delay: 1m
name: hot
transitions:
- conditions:
min_index_age: 1d
state_name: delete
- actions:
- delete: {}
retry:
backoff: exponential
count: 3
delay: 1m
name: delete
transitions: []
total_policies: 3
msg: OK (1919 bytes)
redirected: false
status: 200
url: http://elastic:9200/_opendistro/_ism/policies
Q: "Run task if the policy name can be found in the register output."
A: Declare the list
policies: "{{ index_list.json.policies|map(attribute='_id')|list }}"
gives
policies:
- policy_1
- policy_kuk
- what
Test the list
- debug:
msg: hello
when: "'what' in policies"
Example of a complete playbook for testing
- hosts: localhost
vars:
policies: "{{ index_list.json.policies|map(attribute='_id')|list }}"
tasks:
- include_vars:
file: index_list.json
name: index_list
- debug:
var: index_list
- debug:
var: policies
- debug:
msg: hello
when: "'what' in policies"

How to groupby with custom function in python cuDF?

I am new to using GPU for data manipulations, and have been struggling to replicate some of the functions in cuDF. For instance, I want to get a mode value for each group in the dataset. In Pandas it is easily done with custom functions:
df = pd.DataFrame({'group': [1, 2, 2, 1, 3, 1, 2],
'value': [10, 10, 30, 20, 20, 10, 30]}
| group | value |
| ----- | ----- |
| 1 | 10 |
| 2 | 10 |
| 2 | 30 |
| 1 | 20 |
| 3 | 20 |
| 1 | 10 |
| 2 | 30 |
def get_mode(customer):
freq = {}
for category in customer:
freq[category] = freq.get(category, 0) + 1
key = max(freq, key=freq.get)
return [key, freq[key]]
df.groupby('group').agg(get_mode)
| group | value |
| ----- | ----- |
| 1 | 10 |
| 2 | 30 |
| 3 | 20 |
However, I just can't seem to be able to replicate the same functionality in cuDF. Even though there seems to be a way to do it, of which I have found some examples, but it somehow doesn't work in my case. For example, the following is the function I tried to use for cuDF:
def get_mode(group, mode):
print(group)
freq = {}
for i in range(cuda.threadIdx.x, len(group), cuda.blockDim.x):
category = group[i]
freq[category] = freq.get(category, 0) + 1
mode = max(freq, key=freq.get)
max_freq = freq[mode]
df.groupby('group').apply_grouped(get_mode, incols=['group'],
outcols=dict((mode=np.float64))
Can someone please help me understand what is going wrong here, and how to fix it? Attempting to run the code above throws the following error (hopefully I managed to put it under the spoiler):
Error code
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Failed in cuda mode pipeline (step: nopython frontend)
- Resolution failure for literal arguments:
No implementation of function Function(<function impl_get at 0x7fa8f0500710>) found for signature:
>>> impl_get(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))
There are 2 candidate implementations:
- Of which 1 did not match due to:
Overload in function 'impl_get': File: numba/typed/dictobject.py: Line 710.
With argument(s): '(DictType[undefined,undefined]<iv=None>, int64, int64)':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type DictType[undefined,undefined]<iv=None>
During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
File "../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py", line 719:
def impl(dct, key, default=None):
castedkey = _cast(key, keyty)
^
raised from /opt/conda/lib/python3.7/site-packages/numba/core/typeinfer.py:1086
- Of which 1 did not match due to:
Overload in function 'impl_get': File: numba/typed/dictobject.py: Line 710.
With argument(s): '(DictType[undefined,undefined]<iv={}>, int64, Literal[int](0))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type DictType[undefined,undefined]<iv={}>
During: typing of argument at /opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py (719)
File "../../opt/conda/lib/python3.7/site-packages/numba/typed/dictobject.py", line 719:
def impl(dct, key, default=None):
castedkey = _cast(key, keyty)
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.DictType'>, 'get') for DictType[undefined,undefined]<iv={}>)
During: typing of call at /tmp/ipykernel_33/2595976848.py (6)
File "../../tmp/ipykernel_33/2595976848.py", line 6:
<source missing, REPL/exec in use?>
During: resolving callee type: type(<numba.cuda.compiler.Dispatcher object at 0x7fa8afe49520>)
During: typing of call at <string> (10)
File "<string>", line 10:
<source missing, REPL/exec in use?>
cuDF builds on top of Numba's CUDA target to enable UDFs. This doesn't support using a dictionary in the UDF, but you your use case can expressed with built-in operations with pandas or cuDF by combining value_counts and drop_duplicates.
import pandas as pd
​
df = pd.DataFrame(
{
'group': [1, 2, 2, 1, 3, 1, 2],
'value': [10, 10, 30, 20, 20, 10, 30]
}
)
​
out = (
df
.value_counts()
.reset_index(name="count")
.sort_values(["group", "count"], ascending=False)
.drop_duplicates(subset="group", keep="first")
)
print(out[["group", "value"]])
group value
4 3 20
1 2 30
0 1 10
This is probably not the answer you are looking for, but I found a workaround for mode. It isn't the best way, doesn't use GPU, and can be quite slow.
import pandas as pd
import cudf
df = cudf.DataFrame({'group': [1, 2, 2, 1, 3, 1, 2],
'value': [10, 10, 30, 20, 20, 10, 30]}
df.to_pandas().groupby('group').agg({'value':pd.Series.mode})

How should I format my dataset to avoid this? "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers"

I'm training dialoGPT on my own dataset, following this tutorial.
When I follow exactly the tutorial with the provided dataset I have no issues. I changed the example dataset. The only difference between the example and my code is that my dataset is 256397 lines long compared to the tutorial’s 1906 lines.
I am not sure if the error is pertaining to my column labels in my dataset or if its an issue in one of the text values on a particular row, or the size of my data.
06/12/2020 09:23:08 - WARNING - __main__ - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
06/12/2020 09:23:10 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:10 - INFO - transformers.configuration_utils - Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257
}
06/12/2020 09:23:11 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:11 - INFO - transformers.configuration_utils - Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257
}
06/12/2020 09:23:11 - INFO - transformers.tokenization_utils - Model name 'microsoft/DialoGPT-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'microsoft/DialoGPT-small' is a path, a model identifier, or url to a directory containing tokenizer files.
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json from cache at cached/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt from cache at cached/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/added_tokens.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/special_tokens_map.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/tokenizer_config.json from cache at None
06/12/2020 09:23:19 - INFO - filelock - Lock 140392381680496 acquired on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:19 - INFO - transformers.file_utils - https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin not found in cache or force_download set to True, downloading to /content/drive/My Drive/Colab Notebooks/cached/tmpj1dveq14
Downloading: 100%
351M/351M [00:34<00:00, 10.2MB/s]
06/12/2020 09:23:32 - INFO - transformers.file_utils - storing https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin in cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:32 - INFO - transformers.file_utils - creating metadata file for cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:33 - INFO - filelock - Lock 140392381680496 released on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:33 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin from cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:39 - INFO - transformers.modeling_utils - Weights of GPT2LMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias']
06/12/2020 09:23:54 - INFO - __main__ - Training/evaluation parameters <__main__.Args object at 0x7fafa60a00f0>
06/12/2020 09:23:54 - INFO - __main__ - Creating features from dataset file at cached
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-523c0d2a27d3> in <module>()
----> 1 main(trn_df, val_df)
7 frames
<ipython-input-11-d6dfa312b1f5> in main(df_trn, df_val)
59 # Training
60 if args.do_train:
---> 61 train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)
62
63 global_step, tr_loss = train(args, train_dataset, model, tokenizer)
<ipython-input-9-3c4f1599e14e> in load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate)
40
41 def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
---> 42 return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)
43
44 def set_seed(args):
<ipython-input-9-3c4f1599e14e> in __init__(self, tokenizer, args, df, block_size)
24 self.examples = []
25 for _, row in df.iterrows():
---> 26 conv = construct_conv(row, tokenizer)
27 self.examples.append(conv)
28
<ipython-input-9-3c4f1599e14e> in construct_conv(row, tokenizer, eos)
1 def construct_conv(row, tokenizer, eos = True):
2 flatten = lambda l: [item for sublist in l for item in sublist]
----> 3 conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
4 conv = flatten(conv)
5 return conv
<ipython-input-9-3c4f1599e14e> in <listcomp>(.0)
1 def construct_conv(row, tokenizer, eos = True):
2 flatten = lambda l: [item for sublist in l for item in sublist]
----> 3 conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
4 conv = flatten(conv)
5 return conv
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, **kwargs)
1432 pad_to_max_length=pad_to_max_length,
1433 return_tensors=return_tensors,
-> 1434 **kwargs,
1435 )
1436
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, is_pretokenized, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
1574 )
1575
-> 1576 first_ids = get_input_ids(text)
1577 second_ids = get_input_ids(text_pair) if text_pair is not None else None
1578
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
1554 else:
1555 raise ValueError(
-> 1556 "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
1557 )
1558
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

ElastAlert no hits

The following are my config.yaml and frequency.yaml
config.yaml
rules_folder: rules_folder
run_every:
seconds: 15
buffer_time:
minutes: 1
es_host: localhost
es_port: 9200
writeback_index: elastalert_status
alert_time_limit:
days: 2
frequency.yaml
es_host: localhost
es_port: 9200
name: Error rule
type: any
index: logstash-*
num_events: 5
timeframe:
hours: 4
timestamp_field: "#timestamp"
filter:
term:
log: "ERROR"
alert:
"email"
email:
- "my#email.com"
I am getting no hits
INFO:elastalert:Queried rule Error rule from 2016-09-02 09:33 MDT to 2016-09-02 09:34 MDT: 0 / 0 hits
INFO:elastalert:Ran Error rule from 2016-09-02 09:33 MDT to 2016-09-02 09:34 MDT: 0 query hits, 0 matches, 0 alerts sent
Output of elastalert-test-rule rules_folder/frequency.yaml
INFO:elastalert:Queried rule Error rule from 2016-09-02 09:47 MDT to 2016-09-02 10:32 MDT: 0 / 0 hits
Would have written the following documents to elastalert_status:
elastalert_status - {'hits': 0, 'matches': 0, '#timestamp':
datetime.datetime(2016, 9, 2, 16, 32, 32, 200330, tzinfo=tzutc()), 'rule_name':
'Error rule', 'starttime': datetime.datetime(2016, 9, 1, 16, 32, 32, 123856,
tzinfo=tzutc()), 'endtime': datetime.datetime(2016, 9, 2, 16, 32, 32, 123856,
tzinfo=tzutc()), 'time_taken': 0.07315492630004883}
Ok, I was able to resolve the issue by changing the index from index: logstash-* to index: filebeat-* since I was using it to index. Hope this helps someone.
In output log, you can find that 2016-09-02 09:33 MDT to 2016-09-02 09:34 MDT: 0 / 0 hits, only 1 minutes in query.
Try to setting your buffer_time more than 4 hours (buffer_time > timeframe)
your can reference https://github.com/Yelp/elastalert/issues/668, by Qmando reply

Implementing a custom MIB in a PySNMP agent

I'm having difficulty implementing a custom MIB in a PySNMP agent.
I've started with:
http://pysnmp.sourceforge.net/examples/4.x/v3arch/agent/cmdrsp.html
created my own MIB file, used build-pysnmp-mib to make a Python module and successfully imported the symbol.
I can't see where to go next. I need to somehow mount the imported symbol on the list of served MIBs and provide an implementation. (It's currently a MIB with one read-only INTEGER property.)
The MIB file passes smilint without warnings, but I've had to manually add a missing MibScalar import to the generated module.
MIB:
TRS-MIB DEFINITIONS ::= BEGIN
internet OBJECT IDENTIFIER ::= { iso(1) org(3) dod(6) 1 }
enterprises OBJECT IDENTIFIER ::= { internet private(4) 1 }
thorcom OBJECT IDENTIFIER ::= { enterprises 27817 }
trs OBJECT IDENTIFIER ::= { thorcom 2 }
trsEntry OBJECT IDENTIFIER ::= { trs 1 }
trsDeliveryTime OBJECT-TYPE
SYNTAX Integer32
ACCESS not-accessible
STATUS mandatory
DESCRIPTION "Average message delivery time in milliseconds."
::= { trsEntry 1 }
END
Code:
#!/usr/bin/env python
# Command Responder
from pysnmp.entity import engine, config
from pysnmp.carrier.asynsock.dgram import udp
#from pysnmp.carrier.asynsock.dgram import udp6
from pysnmp.entity.rfc3413 import cmdrsp, context
from pysnmp.proto.rfc1902 import OctetString
from pysnmp.smi import builder
from pysnmp import debug
debug.setLogger(debug.Debug('all'))
# Create SNMP engine with autogenernated engineID and pre-bound
# to socket transport dispatcher
snmpEngine = engine.SnmpEngine()
# Setup UDP over IPv4 transport endpoint
config.addSocketTransport(
snmpEngine,
udp.domainName,
udp.UdpSocketTransport().openServerMode(('127.0.0.1', 161))
)
# Start of new code
mibBuilder = snmpEngine.msgAndPduDsp.mibInstrumController.mibBuilder
mibSources = mibBuilder.getMibSources() + (
builder.DirMibSource('.'),
)
mibBuilder.setMibSources(*mibSources)
# Create and put on-line my managed object
deliveryTime, = mibBuilder.importSymbols('TRS-MIB', 'trsDeliveryTime')
Integer32, = snmpEngine.msgAndPduDsp.mibInstrumController.mibBuilder.importSymbols('SNMPv2-SMI', 'Integer32')
MibScalarInstance, = mibBuilder.importSymbols('SNMPv2-SMI', 'MibScalarInstance')
class MyDeliveryTime(Integer32):
def readGet(self, name, val, idx, (acFun, acCtx)):
return name, self.syntax.clone(42)
deliveryTimeInstance = MibScalarInstance(
deliveryTime.name, (0,), deliveryTime.syntax
)
mibBuilder.exportSymbols('TRS-MIB', deliveryTimeInstance=deliveryTimeInstance) # creating MIB
# End of new code
# v1/2 setup
config.addV1System(snmpEngine, 'test-agent', 'public')
# v3 setup
config.addV3User(
snmpEngine, 'test-user',
config.usmHMACMD5AuthProtocol, 'authkey1',
config.usmDESPrivProtocol, 'privkey1'
)
# VACM setup
config.addContext(snmpEngine, '')
config.addRwUser(snmpEngine, 1, 'test-agent', 'noAuthNoPriv', (1,3,6)) # v1
config.addRwUser(snmpEngine, 2, 'test-agent', 'noAuthNoPriv', (1,3,6)) # v2c
config.addRwUser(snmpEngine, 3, 'test-user', 'authPriv', (1,3,6)) # v3
# SNMP context
snmpContext = context.SnmpContext(snmpEngine)
# Apps registration
cmdrsp.GetCommandResponder(snmpEngine, snmpContext)
cmdrsp.SetCommandResponder(snmpEngine, snmpContext)
cmdrsp.NextCommandResponder(snmpEngine, snmpContext)
cmdrsp.BulkCommandResponder(snmpEngine, snmpContext)
snmpEngine.transportDispatcher.jobStarted(1) # this job would never finish
snmpEngine.transportDispatcher.runDispatcher()
Generated and amended TRS-MIB.py:
# PySNMP SMI module. Autogenerated from smidump -f python TRS-MIB
# by libsmi2pysnmp-0.1.1 at Fri Aug 31 13:56:45 2012,
# Python version (2, 6, 6, 'final', 0)
# Imported just in case new ASN.1 types would be created
from pyasn1.type import constraint, namedval
# Imports
( Integer, ObjectIdentifier, OctetString, ) = mibBuilder.importSymbols("ASN1", "Integer", "ObjectIdentifier", "OctetString")
( Bits, Integer32, MibIdentifier, MibScalar, TimeTicks, ) = mibBuilder.importSymbols("SNMPv2-SMI", "Bits", "Integer32", "MibIdentifier", "MibScalar", "TimeTicks")
# Objects
internet = MibIdentifier((1, 3, 6, 1))
enterprises = MibIdentifier((1, 3, 6, 1, 4, 1))
thorcom = MibIdentifier((1, 3, 6, 1, 4, 1, 27817))
trs = MibIdentifier((1, 3, 6, 1, 4, 1, 27817, 2))
trsEntry = MibIdentifier((1, 3, 6, 1, 4, 1, 27817, 2, 1))
trsDeliveryTime = MibScalar((1, 3, 6, 1, 4, 1, 27817, 2, 1, 1), Integer32()).setMaxAccess("noaccess")
if mibBuilder.loadTexts: trsDeliveryTime.setDescription("Average message delivery time in milliseconds.")
# Augmentions
# Exports
# Objects
mibBuilder.exportSymbols("TRS-MIB", internet=internet, enterprises=enterprises, thorcom=thorcom, trs=trs, trsEntry=trsEntry, trsDeliveryTime=trsDeliveryTime)
Update:
I now have one error left:
$ snmpget -v2c -c public localhost .1.3.6.1.4.1.27817.2.1.1
Error in packet
Reason: noAccess
Failed object: iso.3.6.1.4.1.27817.2.1.1
The debug is:
DBG: handle_read: transportAddress ('127.0.0.1', 48191) incomingMessage '0,\x02\x01\x01\x04\x06public\xa0\x1f\x02\x04>9\xc4\xa0\x02\x01\x00\x02\x01\x000\x110\x0f\x06\x0b+\x06\x01\x04\x01\x81\xd9)\x02\x01\x01\x05\x00'
DBG: receiveMessage: msgVersion 1, msg decoded
DBG: prepareDataElements: Message:
version='version-2'
community=public
data=PDUs:
get-request=GetRequestPDU:
request-id=1043973280
error-status='noError'
error-index=0
variable-bindings=VarBindList:
VarBind:
name=1.3.6.1.4.1.27817.2.1.1
=_BindValue:
unSpecified=
DBG: value index rebuilt at (1, 3, 6, 1, 6, 3, 18, 1, 1, 1, 2), 1 entries
DBG: processIncomingMsg: looked up securityName MibScalarInstance((1, 3, 6, 1, 6, 3, 18, 1, 1, 1, 3, 116, 101, 115, 116, 45, 97, 103, 101, 110, 116), test-agent) contextEngineId MibScalarInstance((1, 3, 6, 1, 6, 3, 18, 1, 1, 1, 4, 116, 101, 115, 116, 45, 97, 103, 101, 110, 116), �O�c�#��) contextName MibScalarInstance((1, 3, 6, 1, 6, 3, 18, 1, 1, 1, 5, 116, 101, 115, 116, 45, 97, 103, 101, 110, 116), ) by communityName MibScalarInstance((1, 3, 6, 1, 6, 3, 18, 1, 1, 1, 2, 116, 101, 115, 116, 45, 97, 103, 101, 110, 116), public)
DBG: processIncomingMsg: generated maxSizeResponseScopedPDU 65379 securityStateReference 12831470
DBG: prepareDataElements: SM returned securityEngineID SnmpEngineID(hexValue='8004fb857f00163de40e2b7') securityName test-agent
DBG: prepareDataElements: cached by new stateReference 2662033
DBG: receiveMessage: MP succeded
DBG: receiveMessage: PDU GetRequestPDU:
request-id=1043973280
error-status='noError'
error-index=0
variable-bindings=VarBindList:
VarBind:
name=1.3.6.1.4.1.27817.2.1.1
=_BindValue:
unSpecified=
DBG: receiveMessage: pduType TagSet(Tag(tagClass=128, tagFormat=32, tagId=0))
DBG: processPdu: stateReference 2662033, varBinds [(ObjectName(1.3.6.1.4.1.27817.2.1.1), Null(''))]
DBG: getMibInstrum: contextName "", mibInstum <pysnmp.smi.instrum.MibInstrumController instance at 0x7fcbfe3d5e60>
DBG: flipFlopFsm: inputNameVals [(ObjectName(1.3.6.1.4.1.27817.2.1.1), Null(''))]
DBG: flipFlopFsm: state start status ok -> fsmState readTest
DBG: flipFlopFsm: fun <bound method MibTree.readTest of MibTree((1,), None)> failed NoAccessError({'name': (1, 3, 6, 1, 4, 1, 27817, 2, 1, 1), 'idx': 0}) for 1.3.6.1.4.1.27817.2.1.1=Null('')
DBG: flipFlopFsm: state readTest status err -> fsmState stop
DBG: sendRsp: stateReference 2662033, errorStatus noAccess, errorIndex 1, varBinds [(ObjectName(1.3.6.1.4.1.27817.2.1.1), Null(''))]
DBG: returnResponsePdu: PDU ResponsePDU:
request-id=1043973280
error-status='noAccess'
error-index=1
variable-bindings=VarBindList:
VarBind:
name=1.3.6.1.4.1.27817.2.1.1
=_BindValue:
unSpecified=
DBG: prepareResponseMessage: cache read msgID 1043973280 transportDomain (1, 3, 6, 1, 6, 1, 1) transportAddress ('127.0.0.1', 48191) by stateReference 2662033
DBG: prepareResponseMessage: using contextEngineId SnmpEngineID(hexValue='8004fb857f00163de40e2b7') contextName
DBG: generateResponseMsg: recovered community public by securityStateReference 12831470
DBG: generateResponseMsg: Message:
version='version-2'
community=public
data=PDUs:
response=ResponsePDU:
request-id=1043973280
error-status='noAccess'
error-index=1
variable-bindings=VarBindList:
VarBind:
name=1.3.6.1.4.1.27817.2.1.1
=_BindValue:
unSpecified=
DBG: returnResponsePdu: MP suceeded
DBG: receiveMessage: processPdu succeeded
DBG: handle_write: transportAddress ('127.0.0.1', 48191) outgoingMessage '0,\x02\x01\x01\x04\x06public\xa2\x1f\x02\x04>9\xc4\xa0\x02\x01\x06\x02\x01\x010\x110\x0f\x06\x0b+\x06\x01\x04\x01\x81\xd9)\x02\x01\x01\x05\x00'
As for implementing a Managed Object Instance, you have two choices:
Load and subclass the MibScalarInstance class, then override its readGet() method to make it returning your live value. Then instantiate your new class (make sure to pass it appropriate OID that identifies it) and pass it to exportSymbols(), so its OID will get registered at pysnmp Agent.
Load the Integer32 class, subclass it and override its "clone()" method to make it returning your live value. Then load the MibScalarInstance class, instantiate it passing appropriate OID and the instance of your Integer32 subclass, then pass MibScalarInstance object to exportSymbols(), so its OID will get registered at pysnmp Agent.
It may make sense to keep all your code in your own MIB module. Take a look at pysnmp/smi/mibs/instances/*.py to get an idea.
From within your Agent app, invoke mibBuilder.loadModules('TRC-MIB') to load your MIB module into Agent.
In your code you seem to somehow combine the above two approaches: MyDeliveryTime.readGet() will not work, however MyDeliveryTime.clone() or deliveryTimeInstance.readGet() will.
I also faced a similar issue:
Tried putting the MY-MIB.[pyc] files in the ~/.pysnmp/mibs folders but I think what really worked was putting these files in the folder /usr/lib/pytho*/site/pysnmp/smi/mibs/ .
Am working on pysnmp-4.3.9

Resources