I am about to go into the next plan in mixpanel for having too many people and would like to delete some old users first.
Is there a simple way/script/api to bulk delete old users?
I've written two scripts that may come in handy; mixpanel-engage-query and mixpanel-engage-post.
Using the first script (query) you can query your People Data and get a list of profiles, e.g. all users who have $last_seen set to a date older than X months.
Using the second script (post) you can perform actions in batch on those profiles, for example deleting them. See the README for an example of how to perform a batch delete.
Yes there is. Looking at the HTTP spec you'll find the following.
$delete
string Permanently delete the profile from Mixpanel, along with all of
its properties. The value is ignored - the profile is determined by
the $distinct_id from the request itself.
// This removes the user 13793 from Mixpanel
{
"$token": "36ada5b10da39a1347559321baf13063",
"$distinct_id": "13793",
"$delete": ""
}
Batch requests
Both the events endpoint at http://api.mixpanel.com/track/ and the profile update endpoint at http://api.mixpanel.com/engage/ accept batched updates. To send a batch of messages to an endpoint, you should use a POST instead of a GET request. Instead of sending a single JSON object as the data query parameter, send a JSON list of objects, base64 encoded, as the data parameter of an application/x-www-form-urlencoded POST request body.
// Here's a list of events
[
{
"event": "Signed Up",
"properties": {
"distinct_id": "13793",
"token": "e3bc4100330c35722740fb8c6f5abddc",
"Referred By": "Friend",
"time": 1371002000
}
},
{
"event": "Uploaded Photo",
"properties": {
"distinct_id": "13793",
"token": "e3bc4100330c35722740fb8c6f5abddc",
"Topic": "Vacation",
"time": 1371002104
}
}
]
Base64 encoded, the list becomes:
Ww0KICAgIHsNCiAgICAgICAgImV2ZW50IjogIlNpZ25lZCBVcCIsDQogICAgICAgICJwcm9wZXJ0aWVzIjogew0KICAgICAgICAgICAgImRpc3RpbmN0X2lkIjogIjEzNzkzIiwNCiAgICAgICAgICAgICJ0b2tlbiI6ICJlM2JjNDEwMDMzMGMzNTcyMjc0MGZiOGM2ZjVhYmRkYyIsDQogICAgICAgICAgICAiUmVmZXJyZWQgQnkiOiAiRnJpZW5kIiwNCiAgICAgICAgICAgICJ0aW1lIjogMTM3MTAwMjAwMA0KICAgICAgICB9DQogICAgfSwNCiAgICB7DQogICAgICAgICAiZXZlbnQiOiAiVXBsb2FkZWQgUGhvdG8iLA0KICAgICAgICAgICJwcm9wZXJ0aWVzIjogew0KICAgICAgICAgICAgICAiZGlzdGluY3RfaWQiOiAiMTM3OTMiLA0KICAgICAgICAgICAgICAidG9rZW4iOiAiZTNiYzQxMDAzMzBjMzU3MjI3NDBmYjhjNmY1YWJkZGMiLA0KICAgICAgICAgICAgICAiVG9waWMiOiAiVmFjYXRpb24iLA0KICAgICAgICAgICAgICAidGltZSI6IDEzNzEwMDIxMDQNCiAgICAgICAgICB9DQogICAgfQ0KXQ==
So the body of a POST request to send the events as a batch is:
data=Ww0KICAgIHsNCiAgICAgICAgImV2ZW50IjogIlNpZ25lZCBVcCIsDQogICAgICAgICJwcm9wZXJ0aWVzIjogew0KICAgICAgICAgICAgImRpc3RpbmN0X2lkIjogIjEzNzkzIiwNCiAgICAgICAgICAgICJ0b2tlbiI6ICJlM2JjNDEwMDMzMGMzNTcyMjc0MGZiOGM2ZjVhYmRkYyIsDQogICAgICAgICAgICAiUmVmZXJyZWQgQnkiOiAiRnJpZW5kIiwNCiAgICAgICAgICAgICJ0aW1lIjogMTM3MTAwMjAwMA0KICAgICAgICB9DQogICAgfSwNCiAgICB7DQogICAgICAgICAiZXZlbnQiOiAiVXBsb2FkZWQgUGhvdG8iLA0KICAgICAgICAgICJwcm9wZXJ0aWVzIjogew0KICAgICAgICAgICAgICAiZGlzdGluY3RfaWQiOiAiMTM3OTMiLA0KICAgICAgICAgICAgICAidG9rZW4iOiAiZTNiYzQxMDAzMzBjMzU3MjI3NDBmYjhjNmY1YWJkZGMiLA0KICAgICAgICAgICAgICAiVG9waWMiOiAiVmFjYXRpb24iLA0KICAgICAgICAgICAgICAidGltZSI6IDEzNzEwMDIxMDQNCiAgICAgICAgICB9DQogICAgfQ0KXQ==
Both endpoints will accept up to 50 messages in a single batch. Usually, batch requests will have a "time" property associated with events, or a "$time" attribute associated with profile updates.
Using the Mixpanel-api python Module
pip install mixpanel-api
This script will delete any profile that hasn't been seen since January 1st, 2019:
from mixpanel_api import Mixpanel
mixpanel = Mixpanel('MIXPANEL_SECRET', token='MIXPANEL_TOKEN')
deleted_count = mixpanel.people_delete(query_params={ 'selector' : 'user["$last_seen"]<"2019-01-01T00:00:00"'})
print(deleted_count)
Replace MIXPANEL_SECRET and MIXPANEL_TOKEN with your own project tokens.
Install Mixpanel Python API (Click Here)
pip install mixpanel-api
Create a python file : delete_people.py and copy and paste below code and perform changes as per your project configuration, i.e secret,token, filter params etc.
from mixpanel_api import Mixpanel
from datetime import datetime
now = datetime.now()
current_time = now.strftime("%Y_%m_%d_%H_%M_%S")
if __name__ == '__main__':
#Mixpanel Project :
credentials = {
'API_secret': '<Your API Secret>',
'token': '<Your API Token>',
}
# first we are going to make a Mixpanel object instance
mlive = Mixpanel(credentials['API_secret'])
# Mixpanel object with token to delete people
ilive = Mixpanel(credentials['API_secret'],credentials['token'])
#Prepare parameters for delete condition
#<filter_by_cohort_here> - Get from mixpanel explore UI, from engage api xhr call (https://mixpanel.com/api/2.0/engage)
parameters = {'filter_by_cohort':'<filter_by_cohort_here>','include_all_users':'true','limit':0}
# Backup data before deleting
print("\n Creating backup of data\n")
mlive.export_people('backup_people_'+current_time+'.json', parameters)
# Delete people using parameters filter
print("\n Backup Completed! Deleting Data\n")
ilive.people_delete(query_params=parameters)
print("\n Data Deleted Successfully\n")
Run below command from terminal
python delete_people.py
Note: people_delete method of mixpanel api will automatically create backup_timestamp.json file in same directory where you put this script
Related
Does AWS apigateway change http body? How can I stop it from doing this?
My application:
(1) A front end "UI" that sends a "http request" using "POST method" that contains a "zip file" in "body" through "form-data".
(2) AWS "apigateway" receives this request and forward it to "Lambda Proxy"
(3) AWS "Lambda" implemented by python coding receives this request and decompresses this zip file to a temporary folder.
The problem I'm facing:
(1) and (2) works fine, but in (3) the pythong program at lambda failed to decompress the file.
My finding:
It seems that when sending from the "UI" the body contains the binary data of the zip file
like below:
"PK\x03\x04\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00x2.txtPK\x03\x04\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00x1.txtPK\x01\x02\x14\x00\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00x2.txtPK\x01\x02\x14\x00\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00$\x00\x00\x00x1.txtPK\x05\x06\x00\x00\x00\x00\x02\x00\x02\x00h\x00\x00\x00H\x00\x00\x00\x00\x00"
But at (3) the python code at lambda, if we just simply returns the response like below:
response = {
"statusCode": 200,
"headers": {
"lambda-response": str(event["body"])
},
"body": "",
"isBase64Encoded": False
}
return response
will find that the binary data in the body,
seems like apigateway has changed the content
from:
"PK\x03\x04\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00x2.txtPK\x03\x04\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00x1.txtPK\x01\x02\x14\x00\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00x2.txtPK\x01\x02\x14\x00\n\x00\x00\x00\x00\x00\xd6;TO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00$\x00\x00\x00x1.txtPK\x05\x06\x00\x00\x00\x00\x02\x00\x02\x00h\x00\x00\x00H\x00\x00\x00\x00\x00"
into:
"PK\u0003\u0004\n\u0000\u0000\u0000\u0000\u0000\ufffd;TO\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0006\u0000\u0000\u0000x2.txtPK\u0003\u0004\n\u0000\u0000\u0000\u0000\u0000\ufffd;TO\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0006\u0000\u0000\u0000x1.txtPK\u0001\u0002\u0014\u0000\n\u0000\u0000\u0000\u0000\u0000\ufffd;TO\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000
\u0000\u0000\u0000\u0000\u0000\u0000\u0000x2.txtPK\u0001\u0002\u0014\u0000\n\u0000\u0000\u0000\u0000\u0000\ufffd;TO\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0006\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000
\u0000\u0000\u0000$\u0000\u0000\u0000x1.txtPK\u0005\u0006\u0000\u0000\u0000\u0000\u0002\u0000\u0002\u0000h\u0000\u0000\u0000H\u0000\u0000\u0000\u0000\u0000\r\n"
Which is weird, what can I do to stop this?
(2019/12/17 update) below the lambda code I'm using.
import json # to decode json
import os # file IO
import shutil # file IO (use this to recursively force remove a directory)
print('Loading function')
def decompress_zip_file(src_file_path, dest_dir_path):
'''
Decompress a zip file into a directory.
Args:
src_file_path (Srting): source zip file's path.
dest_dir_path (Srting): the destination of the output directory.
Returns:
isSuccess (bool): the operation is successful or not.
'''
error_msg = "Nothing."
try:
if(os.path.isdir(dest_dir_path)):
shutil.rmtree(dest_dir_path)
with zipfile.ZipFile(src_file_path, 'r') as zip_ref:
zip_ref.extractall(dest_dir_path)
except Exception as ep:
error_msg = "Error in decompress_zip_file(), ep={}:{}".format(type(ep).__name__, str(ep))
print(error_msg)
return (False, error_msg)
return (True, error_msg)
def decompress_zip_file_from_content_in_binary(src_file_in_binary, dest_dir_path):
'''
Decompress a zip file content into a directory.
Args:
src_file_in_binary (byte array): source zip file's content in binary format.
dest_dir_path (Srting): the destination of the output directory.
Returns:
isSuccess (bool): the operation is successful or not.
'''
# write the obtained binary data into a tmp zip file
tmp_file_path = "/tmp/tmp.zip"
if(os.path.isfile(tmp_file_path)):
os.remove(tmp_file_path)
output_file = open(tmp_file_path, 'wb')
output_file.write(src_file_in_binary)
output_file.close()
(isSuccess, error_msg) = decompress_zip_file(tmp_file_path, dest_dir_path)
return (isSuccess, error_msg)
def convert_from_http_body_encoding_to_local_binary(extracted_file_from_http_body_str):
'''
Extract the file (in binary string format) from event['body'] encoding to local binary encoding.
Args:
extracted_file_from_http_body_str (string): the event['body'] file (in binary string format),.
Returns:
zipfile_binary1 (binary array): the conversion result.
'''
zipfile_binary1 = bytes(extracted_file_from_http_body_str, encoding = "ascii") # convert into a zipfile in binary format
return zipfile_binary1
def extract_zipfile_binary_from_body(body_str):
'''
Extract the zipfile (in binary format) from event['body'] string.
Args:
body_str (string): the event['body'] string.
Returns:
(binary array): the conversion result.
'''
retValue = ""
tmpArray = body_str.split("application/zip") # split the content based on MIME part field data; cut the head
if(len(tmpArray) > 1):
retValue += "entered-Lv1."
tmpArray = tmpArray[1].split("PK") # split the content based on zip file header.
if(len(tmpArray) > 1):
retValue += "entered-Lv2."
zipfile_str = "PK" + 'PK'.join(tmpArray[1:]) # add back the zip file header
tmpArray = zipfile_str.split("------WebKitFormBoundary") # split the content based on MIME part field data; cut the tail
if(len(tmpArray) > 1):
zipfile_str = tmpArray[0]
zipfile_binary = convert_from_http_body_encoding_to_local_binary(zipfile_str)
retValue = zipfile_binary
return retValue
def handler(event, context):
'''Provide an event that contains the following keys:
- operation: one of the operations in the operations dict below
- payload: a parameter to pass to the operation being performed
'''
# set the mapping table for "operation" x "return value"
operations = {
'unzip': lambda x: decompress_zip_file_from_content_in_binary(**x), # unzip an uploaded file
'ping': lambda x: 'pong' # respond to ping req.
}
# because we use "Lambda Proxe", means we have api-gateway forward the whole packet without resolving it for lambda.
event_headers = event["headers"]
operation = event_headers['operation']
event_body = event["body"]
if(operation == 'unzip'):
src_file_in_binary = extract_zipfile_binary_from_body(event_body)
payload_json = {}
payload_json['src_file_in_binary'] = src_file_in_binary
payload_json['dest_dir_path'] = "/tmp/tmp_zipfile_output"
event_headers["payload"] = payload_json
if operation in operations:
responseBody = operations[operation](event_headers.get('payload'))
response = {
"statusCode": 200,
"headers": {
"lambda-response": str(responseBody) # the api-gateway will forward the header to the front end.
},
"body": "",
"isBase64Encoded": False
}
return response
else:
raise ValueError('Unrecognized operation "{}"'.format(operation))
Below is a response from AWS support. LGTM. Leave it here so that people can see the solution to this issue in the future.
=====================Below is the response from AWS support ==================
Hi,
Thank you for contacting AWS Premium Support. I am Jyoti, and I will assist you with this case today.
From the case correspondence, I understand that you are concerned that API Gateway modifies
the binary data payload before proxying to your Lambda function. Please correct me if my understanding is wrong.
Expected Behaviour:
API gateway does modify the binary data payload into UTF-8 encoded JSON strings if
the API is configured at its default settings. Hence this is an expected behaviour.
Kindly note, as per [1], we must configure the API to support binary payloads for
our API in API Gateway. API Gateway can not send binary as is, since it has to send
a JSON body to the lambda proxy. Hence, it encodes the data/payload in UTF-8 by default.
Solution:
In order to overcome the aforementioned challenge, we need to add the desired
binary media types (application/zip in this case) to the binaryMediaTypes list
on the RestApi resource's settings page. For further information on how to achieve
this, please refer here --> [2]. If this property is not defined, the payloads
are handled as UTF-8 encoded JSON strings as mentioned in [1].
This is why the file in your request looks UTF-8 encoded. After configuring the API,
the event received by the Lambda would be a Base64-encoded string.
If you want to conduct operations on this object (the encoded request body or 'event["body"]'),
then you may decode the base64-encoded string to its orginal binary form by following
the below lines (in case of python runtime) :
import base64
coded_string = str(event["body"])
base64.b64decode(coded_string)
Troubleshooting:
I tried to replicate your setup in my environment. Instead of the frontend 'UI' of the application,
I used Postman as a client, while the rest of the setup (API Gateway and Lambda) are similar.
I am making a POST request to my API from Postman, with the request headers 'Content-Type' and 'Accept',
both set to the value 'application/zip', which is the binary media type that is being sent and
also being expected in the response. My API has been configured to support binary media types being
passed in the request body. I have added 'application/zip' in the binaryMediaTypes list for the API.
Finally, in the Lambda function I am decoding the base64-encoded request body (i.e. event["body"])
to its original binary form by using the base64 library (in python).
If you still want to confirm the consistency of your request's form-data out by returning the binary
data in your response, you can refer to the following snippet:
response {
'isBase64Encoded': True, #Ensure the body is base encoded
'statusCode': 200,
'headers': { "Content-Type": "applicaiton/zip" }, #Define the Content-Type
'body': event["body"] #Response Body returns the Base64-encoded value
}
We set the isBase64Encoded parameter to True and API Gateway automatically decodes the
response body depending on the Content-Type (i.e. the binary data/media type) that the
client (in my case Postman), is set to receive (i.e. application/zip). Kindly note, the 'Accept'
header that I had sent in my header, is to validate that the response body contains the binary
data type, the request was made for.
The above response body was the same as the request body binary data that was first sent
through the API, in my setup.
Hope I have addressed your concerns. However, if you still need help with the implementation,
please contact us again and I will be happy to assist you.
References:
=-=-=-=--=-=-=-=-=-=
[1] Support Binary Payloads in API Gateway: https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings.html
[2] Enable Binary Support Using the API Gateway Console: https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-payload-encodings-configure-with-console.html
Best regards,
Jyoti Prakash P.
Amazon Web Services
2019/12/20 update
I realize that my content type is actually multipart instead of application/zip so I modified again the setting then it worked.
Below is the help from AWS support. Many thanks to their help.
Hi,
Thanks a lot for elaborating your application flow and the logs. I understand now that your HTTP Request header 'content-type' is set to 'multipart/form-data'. I agree that for a web form to upload a file it is quite common to set content type as form-data and AWS API Gateway does support it. You would like to know if you could prevent UTF-8 encoding without changing the front end code. Please correct me if my understanding is wrong.
For the ease of discussion, I would like to separate the approach of troubleshooting for the HTTP request and response.
For the request to the API:
Please add 'multipart/form-data' as one of the values in the binaryMediaType list in your "API settings page in the API Gateway console. You would not have to alter your code or HTTP request or any of it's headers. Kindly note to handle binary media/data in API Gateway, the HTTP Request Content-Type header must match the values in binaryMediaType list.
In your use case, if you want to send the binary media back in a response for your request, the HTTP Request 'Content-Type' and 'Accept' headers, the binaryMediaType value of the API and the HTTP Response 'Content-Type' must all be set to 'multipart/form-data'. I tried the above and it worked for me with Postman Client. The 'boundary' directive is set up by Postman automatically if the HTTP Request 'Content-Type' is set to 'multipart/form-data'. Hence, you would have to only add 'multipart/form-data' in the 'binaryMediaType' list. Please have a look at my HTTP request, below:
POST /stg-with-logs HTTP/1.1
Host: <some-api-id>.execute-api.us-east-1.amazonaws.com
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW
Accept: multipart/form-data
Cache-Control: no-cache
Postman-Token: 123b64f9-5669-f794-b9df-34a7561e9708
------WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name="File"; filename="archive.zip"
Content-Type: application/zip
------WebKitFormBoundary7MA4YWxkTrZu0gW--
For the response from the API:
I noticed while going through your API Gateway Logs, the header 'isBase64Encoded' was not set. Kindly set that to true. API Gateway automatically decodes any base64-encoded string in the body of your HTTP response, if 'isBase64Encoded' is set to true. Please have a look at the HTTP Response from my lambda below:
(a6729f56-b245-45a4-9ac4-7e00b23c8957) Endpoint response body before transformations:
{
"isBase64Encoded": true,
"statusCode": 200,
"headers": {
"Content-Type": "multipart/form-data",
"Accpet": "multipart/form-data"
},
"body": "LS0tLS0tV2ViS2l0Rm9ybUJvdW5kYXJ5SmxkSW1aV1lHczlSTndPWQ0KQ29udGVudC1EaXNwb3NpdGlvbjogZm9ybS1kYXRhOyBuYW1lPSJGaWxlIjsgZmlsZW5hbWU9ImFyY2hpdmUuemlwIg0KQ29udGVudC1UeXBlOiBhcHBsaWNhdGlvbi96aXANCg0KUEsDBBQAAAAIAKF4kE9/Mo7/XgAAAJcAAAAaABwASGVsbG8tV29ybGQtNjY3MzMxNTI4MS50eHRVVAkAA8ZP910SUPdddXgLAAEEHZHreQTMewNxNY1BDgIxDAPvvIVPOY3SEC+9WCrfJ13EZWTNHKwKkzMmxIp5dpsnFMlqrjzBF/SKxCW2/8dl3ttGGjTqnkdMG+Wwj96jA3/YJsC2QF9iesuLUXPfv80KrpaVYeDjC1BLAQIeAxQAAAAIAKF4kE9/Mo7/XgAAAJcAAAAaABgAAAAAAAEAAACkgQAAAABIZWxsby1Xb3JsZC02NjczMzE1MjgxLnR4dFVUBQADxk/3XXV4CwABBB2R63kEzHsDcVBLBQYAAAAAAQABAGAAAACyAAAAAAANCi0tLS0tLVdlYktpdEZvcm1Cb3VuZGFyeUpsZEltWldZR3M5Uk53T1kNCkNvbnRlbnQtRGlzcG9zaXRpb246IGZvcm0tZGF0YTsgbmFtZT0iVGVzdCBEYXRhIg0KDQpUZXN0aW5nIEJvdW5kYXJ5IGluIG11bHRpcGFydC9mb3JtLWRhdGENCi0tLS0tLVdlYktpdEZvcm1Cb3VuZGFyeUpsZEltWldZR3M5Uk53T1ktLQ0K"
}
Along with this correspondence I am attaching my API Gateway Swagger file and Lambda function code for your reference. The setup worked fine for me and I was able to return the binary payload upon making an HTTP Request. If you want to test it out in your environment, please set the appropriate credentials and lambda uri in the Swagger file.
Hope this addresses your concern. However, if the issue still persists or you have any further questions, please contact us again and I will be happy to assist you.
To see the file named 'binaryPost-stg-with-logs-oas30-apigateway.yaml,python-binary-response.py' included with this correspondence, please use the case link given below the signature.
Best regards,
Jyoti Prakash P.
Amazon Web Services
Check out the AWS Support Knowledge Center, a knowledge base of articles and videos that answer customer questions about AWS services: https://aws.amazon.com/premiumsupport/knowledge-center/?icmpid=support_email_category
I'm trying to access url and then parse it's contents based on tags.
My code:
page = requests.get('https://support.apple.com/downloads/')
self.tree = html.fromstring(page.content)
names = self.tree.xpath("//span[#class='truncate_name']//text()")
Problem: variable page is containing data that of url 'https://support.apple.com/'
I'm new to python 2.7. The whole encoding issues in file. I'm using unicode-escape as my default encoding. Encoding on resource at https://support.apple.com/downloads/ is utf-8 whereas encoding of resource at https://support.apple.com/ is variable. Is this has something to do with the problem? Please suggest solution for this.
It has nothing to do with encoding , what you are looking for is dynamically created so not in the source you get back. A series of ajax calls populates the data. To get the product names etc.. from the carousel where you see the span.truncate_name in your browser:
params = {"page": "products",
"locale": "en_US",
"doctype": "DOWNLOADS",
}
js = requests.get("https://km.support.apple.com/kb/index", params=params).content
Normally we could call .json() on the response object but in this case we need to use "unicode_escape" then call loads:
from json import loads, dumps
js2 = loads(js.decode("unicode_escape"))
print(js2)
Which gives you a huge dict of data like:
{u'products': [{u'name': u'Servers and Enterprise', u'urlpath': u'serversandenterprise', u'order': u'', u'products': .............
You can see the request in chrome tools:
We leave off callback:ACDownloadSearch.customCallBack as we want to get back valid json.
I am working on implementing Sensu monitoring ( work with graphite + e-mail alert).. everything is OK but only the email alert part. I managed to get the e-mail system to send out the alert but it in below format:
{"id":"a1c608aa-e207-49fe-905d-6037f6db01f2","client":
{"name":"ABC","address":"0.0.0.0","subscriptions":["abc"],"version":"0.23.3","timestamp":1464499552},"check":{"command":"/etc/sensu/plugins/check_load
-w 8.00,5.00,2.00 -c
10.00,8.00,3.00","subscribers":["ABC","adef","xyz"],"handlers":["default","email"],"interval":60,"name":"check_CPU_usage","issued":1464499558,"executed":1464499558,"duration":0.005,"output":"CRITICAL
- load average: 5.54, 5.44, 4.09|load1=5.540;8.000;10.000;0;
load5=5.440;5.000;8.000;0; load15=4.090;2.000;3.000;0;
\n","status":2,"history":["1","1","1","1","1","1","1","1","1","1","1","0","1","2","2","2","2","2","2","2","2"],"total_state_change":15},"occurrences":8,"action":"create","timestamp":1464499558}
But both of support team and my teammate would like to have both friendly user format at the first half of the e-mail alert and the raw log OR the only "output" attribute in the second half.
Now my e-mail.json is as below: I know i did try to add "output" here but still does not work..:(
{
"handlers": {
"email_devops": {
"type": "pipe",
"command": "mail -s \"Development environment sensu alert\" myemail#company.com",
"severities": ["warning","critical"],
"output": " Warning : the process of ::name:: had reached to warning threshold"
}
}
}
I found some article , i found something about as per link: http://search-devops.com/m/wbRqS5nPvh2WnZfj1&subj=Sensu+alert+in+Html+format
But i still stuck on how to push together..
Please kindly help.
Thanks in advance.
Miss Sumana W.
I suggest you use an additional handler to parse the output, make it use it user-friendly and then sent it by email.
"email_devops": {
"type": "pipe",
"command": "my_mail_handler",
"severities": ["warning","critical"]
}
And then, in "my_mail_handler", which can be a script in any language, but let's say is a ruby script:
#!/opt/sensu/embedded/bin/ruby
require 'json'
require 'net/smtp'
#event = JSON.parse(STDIN.read, :symbolize_names => true)
message = <<MESSAGE_END
From: myemail#company.com
To: myemail#company.com
Subject: Sensu Check Event. Check '#{#event[:check][:name]}' in Node: '#{#event[:client][:name]}' #{#event[:occurrences]} times
Output:
#{#event[:check][:output]}
MESSAGE_END
Net::SMTP.start('localhost') do |smtp|
smtp.send_message message, 'myemail#company.com', 'myemail#company.com'
end
Naturally, you can explore, use and modify the #event variable to create your subject and body.
I want to skip some file type link .exe .zip .pdf while crawling with scrapy, but don't want to use Rule with specific url regular. How?
Update:
Due to that it's hard to decide whether to follow this link just by Content-Type in response when the body hasn't been downloaded. I change to drop url in downloader middleware. thanks Peter and Leo.
If you go to linkextractor.py within the Scrapy root directory, you will see the following:
"""
Common code and definitions used by Link extractors (located in
scrapy.contrib.linkextractor).
"""
# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
# images
'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',
# audio
'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',
# video
'3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
'm4a',
# other
'css', 'pdf', 'doc', 'exe', 'bin', 'rss', 'zip', 'rar',
]
However, since this applies to the linkextractor (and you don't want to use Rules), I am not sure that this will solve your problem (I just realized you specified that you didn't want to use Rules. I thought you had asked how to change the file-extension restrictions without needing to specify directly in a rule).
The good news is, you can also build your own downloader middleware and drop any/all requests to urls which have an undesirable extension. See Downloader Middlerware
You can get the requested url by accessing the request object's url attribute as follows: request.url
Basically, search the end of the string for '.exe' or whatever extension you want to drop, and if it contains said extentions, return an IgnoreRequest exception, and the request will immediately be dropped.
UPDATE
In order to process the request prior to it being downloaded, you need to make sure you define the 'process_request' method within your custom downloader middleware.
According to the Scrapy documentation
process_request
This method is called for each request that goes through the download
middleware.
process_request() should return either None, a Response object, or a
Request object.
If it returns None, Scrapy will continue processing this request,
executing all other middlewares until, finally, the appropriate
downloader handler is called the request performed (and its response
downloaded).
If it returns a Response object, Scrapy won’t bother calling ANY other
request or exception middleware, or the appropriate download
function; it’ll return that Response. Response middleware is always
called on every Response.
If it returns a Request object, the returned request will be
rescheduled (in the Scheduler) to be downloaded in the future. The
callback of the original request will always be called. If the new
request has a callback it will be called with the response
downloaded, and the output of that callback will then be passed to the
original callback. If the new request doesn’t have a callback, the
response downloaded will be just passed to the original request
callback.
If it returns an IgnoreRequest exception, the entire request will be
dropped completely and its callback never called.
So essentially, just create a downloader class, add a method class process_request, which takes a request object and spider object as parameters. Then return the IgnoreRequest exception if the url contains unwanted extensions.
This should all occur prior to the page being downloaded. However, if you are wanting to process the response headers instead, than a request will have to be made to the webpage.
You could always implement both a process_request and process_response method in the downloader, with the idea being that obvious extensions will immediately be dropped, and than, if for some reason the url did not contain the file extension, the request would be process and caught in the process_request method (since you could verify in the headers)?
.zip and .pdf are ignored by scrapy by default.
As a general rule you can either configure a rule to include only urls that match your regexp (.htm* in this case):
rules = (Rule(SgmlLinkExtractor(allow=('\.htm')), callback='parse_page', follow=True, ), )
or exclude the ones that match a regexp:
rules = (Rule(SgmlLinkExtractor(allow=('.*'), deny=('\.pdf', '\.zip')), callback='parse_page', follow=True, ), )
Read the documentation for more information.
I built this Middleware to exclude any response type that isn't in a whitelist of regular expressions:
from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re
class FilterResponses(object):
"""Limit the HTTP response types that Scrapy dowloads."""
#staticmethod
def is_valid_response(type_whitelist, content_type_header):
for type_regex in type_whitelist:
if re.search(type_regex, content_type_header):
return True
return False
def process_response(self, request, response, spider):
"""
Only allow HTTP response types that that match the given list of
filtering regexs
"""
# to specify on a per-spider basis
# type_whitelist = getattr(spider, "response_type_whitelist", None)
type_whitelist = (r'text', )
content_type_header = response.headers.get('content-type', None)
if not content_type_header or not type_whitelist:
return response
if self.is_valid_response(type_whitelist, content_type_header):
return response
else:
msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
log.msg(msg, level=log.INFO)
raise IgnoreRequest()
To use it, add it to settings.py:
DOWNLOADER_MIDDLEWARES = {
'[project_name].middlewares.FilterResponses': 999,
}
I have the following Business Process defined within a Production on an Intersystems Cache Installation
/// Makes a call to Merlin based on the message sent to it from the pre-processor
Class sgh.Process.MerlinProcessor Extends Ens.BusinessProcess [ ClassType = persistent, ProcedureBlock ]
{
Property WorkingDirectory As %String;
Property WebServer As %String;
Property CacheServer As %String;
Property Port As %String;
Property Location As %String;
Parameter SETTINGS = "WorkingDirectory,WebServer,Location,Port,CacheServer";
Method OnRequest(pRequest As sgh.Message.MerlinTransmissionRequest, Output pResponse As Ens.Response) As %Status
{
Set tSC=$$$OK
Do ##class(sgh.Utils.Debug).LogDebugMsg("Packaging an HTTP request for Saved form "_pRequest.DateTimeSaved)
Set dateTimeSaved = pRequest.DateTimeSaved
Set patientId = pRequest.PatientId
Set latestDateTimeSaved = pRequest.LatestDateTimeSaved
Set formName = pRequest.FormName
Set formId = pRequest.FormId
Set episodeNumber = pRequest.EpisodeNumber
Set sentElectronically = pRequest.SentElectronically
Set styleSheet = pRequest.PrintName
Do ##class(sgh.Utils.Debug).LogDebugMsg("Creating HTTP Request Class")
set HTTPReq = ##class(%Net.HttpRequest).%New()
Set HTTPReq.Server = ..WebServer
Set HTTPReq.Port = ..Port
do HTTPReq.InsertParam("DateTimeSaved",dateTimeSaved)
do HTTPReq.InsertParam("HospitalNumber",patientId)
do HTTPReq.InsertParam("Episode",episodeNumber)
do HTTPReq.InsertParam("Stylesheet",styleSheet)
do HTTPReq.InsertParam("Server",..CacheServer)
Set Status = HTTPReq.Post(..Location,0) Quit:$$$ISERR(tSC)
Do ##class(sgh.Utils.Debug).LogDebugMsg("Sent the following request: "_Status)
Quit tSC
}
}
The thing is when I check the debug value (which is defined as a global) all I get is the number '1' - I have no idea therefore if the request has succeeded or even what is wrong (if it has not)
What do I need to do to find out
A) What is the actual web call being made?
B) What the response is?
There is a really slick way to get the answer the two questions you've asked, regardless of where you're using the code. Check the documentation out on the %Net.HttpRequest object here: http://docs.intersystems.com/ens20102/csp/docbook/DocBook.UI.Page.cls?KEY=GNET_http and the class reference here: http://docs.intersystems.com/ens20102/csp/documatic/%25CSP.Documatic.cls?APP=1&LIBRARY=ENSLIB&CLASSNAME=%25Net.HttpRequest
The class reference for the Post method has a parameter called test, that will do what you're looking for. Here's the excerpt:
method Post(location As %String = "", test As %Integer = 0, reset As %Boolean = 1) as %Status
Issue the Http 'post' request, this is used to send data to the web server such as the results of a form, or upload a file. If this completes correctly the response to this request will be in the HttpResponse. The location is the url to request, e.g. '/test.html'. This can contain parameters which are assumed to be already URL escaped, e.g. '/test.html?PARAM=%25VALUE' sets PARAM to %VALUE. If test is 1 then instead of connecting to a remote machine it will just output what it would have send to the web server to the current device, if test is 2 then it will output the response to the current device after the Post. This can be used to check that it will send what you are expecting. This calls Reset automatically after reading the response, except in test=1 mode or if reset=0.
I recommend moving this code to a test routine to view the output properly in terminal. It would look something like this:
// To view the REQUEST you are sending
Set sc = request.Post("/someserver/servlet/webmethod",1)
// To view the RESPONSE you are receiving
Set sc = request.Post("/someserver/servlet/webmethod",2)
// You could also do something like this to parse your RESPONSE stream
Write request.HttpResponse.Data.Read()
I believe the answer you want to A) is in the Server and Location properties of your %Net.HttpRequest object (e.g., HTTPReq.Server and HTTPReq.Location).
For B), the response information should be in the %Net.HttpResponse object stored in the HttpResponse property (e.g. HTTPReq.HttpResponse) after your call is completed.
I hope this helps!
-Derek
(edited for formatting)
From that code sample it looks like you're using Ensemble, not straight-up Cache.
In that case you should be doing this HTTP call in a Business Operation that uses the HTTP Outbound Adapter, not in your Business Process.
See this link for more info on HTTP Adapters:
http://docs.intersystems.com/ens20102/csp/docbook/DocBook.UI.Page.cls?KEY=EHTP
You should also look into how to use the Ensemble Message Browser. That should help with your logging needs.