Anomaly detection on Azure Databricks Diagnostic audit logs - elasticsearch

I have a lot of audit logs coming from the Azure Databricks clusters I am managing. The logs are simple application audit logs in the format of JSON. You have information about jobs, clusters, notebooks, etc. and you can see a sample of one record here:
{
"TenantId": "<your tenant id",
"SourceSystem": "|Databricks|",
"TimeGenerated": "2019-05-01T00:18:58Z",
"ResourceId": "/SUBSCRIPTIONS/SUBSCRIPTION_ID/RESOURCEGROUPS/RESOURCE_GROUP/PROVIDERS/MICROSOFT.DATABRICKS/WORKSPACES/PAID-VNET-ADB-PORTAL",
"OperationName": "Microsoft.Databricks/jobs/create",
"OperationVersion": "1.0.0",
"Category": "jobs",
"Identity": {
"email": "mail#contoso.com",
"subjectName": null
},
"SourceIPAddress": "131.0.0.0",
"LogId": "201b6d83-396a-4f3c-9dee-65c971ddeb2b",
"ServiceName": "jobs",
"UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
"SessionId": "webapp-cons-webapp-01exaj6u94682b1an89u7g166c",
"ActionName": "create",
"RequestId": "ServiceMain-206b2474f0620002",
"Response": {
"statusCode": 200,
"result": "{\"job_id\":1}"
},
"RequestParams": {
"name": "Untitled",
"new_cluster": "{\"node_type_id\":\"Standard_DS3_v2\",\"spark_version\":\"5.2.x-scala2.11\",\"num_workers\":8,\"spark_conf\":{\"spark.databricks.delta.preview.enabled\":\"true\"},\"cluster_creator\":\"JOB_LAUNCHER\",\"spark_env_vars\":{\"PYSPARK_PYTHON\":\"/databricks/python3/bin/python3\"},\"enable_elastic_disk\":true}"
},
"Type": "DatabricksJobs"
}
At the moment, I am storing the logs into Elasticsearch and I was planning to use their Anomaly Detection tool on this type of logs. Therefore, I do not need to implement any algorithm, but rather choose the right attribute, or perform the right aggregation, or maybe combine more attributes using a multi-variate analysis. However, I am not familiar with such topic nor I have this background. I have read Anomaly Detection: A Survey by Chandola et al., which was pretty useful to point me to the right sub-field.
So, I have understood that I am dealing with time series and depending on the kind of aggregation I will perform I might face collective anomalies on sequence data (eg: the ActionName field of these logs) or contextual anomalies on sequence data.
I was wondering whether you can point me in the right direction, since I haven't managed to find any related work of anomaly detection on audit logs. More specifically, what kind of anomalies should I investigate? and which kind of aggregation will be beneficial?
Please keep in mind that I have a quite large amount of data. Moreover, I would appreciate any kind of feedback, even if it doesn't involve Elasticsearch; therefore, feel free to propose a whole unsupervised machine learning method for this kind of anomaly detection scenario rather than a simpler use case of Elasticsearch.

Related

enableAutoTierToHotFromCool Does not move from cool to hot

I have some Azure Storage Accounts (StorageV2) located in West Europe. All blobs uploaded are by default in the Hot tier and I have this lifecycle rule defined on them:
{
"rules": [
{
"enabled": true,
"name": "moveToCool",
"type": "Lifecycle",
"definition": {
"actions": {
"baseBlob": {
"enableAutoTierToHotFromCool": true,
"tierToCool": {
"daysAfterLastAccessTimeGreaterThan": 1
}
}
},
"filters": {
"blobTypes": [
"blockBlob"
]
}
}
}
]
}
Somehow the uploaded blobs are moved to cool but then after I access them again, in the portal they still appear under Cool tier. Any idea why? (I have waited more than 24 for the rule to be in effect)
Some more questions about: "enableAutoTierToHotFromCool": true:
does it depend on the blob size? (for example if some blobs were moved to cool and then they accessed simultaneously the time between a 1 Gib is moved back to hot is the same for 10KiB blob)
does it depend on the number of blobs that are accessed? (it there a queue and if multiple blobs from cool are accessed in the same time, the requests are served based on a queue order)
The enableAutoTierToHotFromCool property is a Boolean value that indicates whether a blob should automatically be tiered from cool back to hot if it is accessed again after being tiered to cool.
And to apply new policy it takes 48hrs, and enableAutoTierToHotFromCool": true doesn’t depend on size of blob , and not depends on the number of blobs
If you enable firewall rules for your storage account, lifecycle management requests may be blocked. You can unblock these requests by providing exceptions for trusted Microsoft services. For more information, refer this document the Exceptions section in Configure firewalls and virtual networks.
A lifecycle management policy must be read or written in full. Partial updates are not supported. So try with writing
"prefixMatch": [
"containerName/log"
]
For more details refer this document:

JSON-LD multiple types

I am currently doing some json-ld. I am quite new to this(also with coding). I am trying to figure it out how could I use different Types in one script, as you can see below. I cannot get a hold onto what am I doing wrong and what should I change to make it work? Thanks
<script type="application/ld+json">
{
"#context": "https://schema.org",
"#type": "Course",
"name": "MSc in IT- Web Communication Design",
"coursePrerequisites": "The following bachelor degree programmes from the University of Southern Denmark and from other universities provide access to the Master’s degree in Web Communication Design: A relevant professional bachelor's degree, e.g. web developer, software developer, business language and IT-based marketing communication, school teacher, nurse, educator, social worker.",
"occupationalCredentialAwarded": "As a student of the MSc in IT – Web Communication Design you will gain specialised skills in web-based communication and knowledge management. Your choice of elective courses, your projects, your thesis as well as your bachelor background qualify you to work with: Web development, digitalisation, web design, digital skills development, social media, etc.",
"description":"Master of Science in IT Web Communication Design. A multi-disciplinary graduate programme that combines IT, communication and organisation. We emphasise the interaction between humans and information technology and combine research-based knowledge with challenges from practice."
},
"provider": {
"#type": "Organization",
"name": "University of Southern Denmark",
"department": "Institute for Design and Communication",
"address": "Universitetsparken 1, 6000 Kolding, Denmark",
"telephone": "+45 65 50 10 00"
},
{
"#context": "http://schema.org",
"#type": "EducationalOccupationalCredential",
"programPrerequisites": "You are expected to have basic knowledge of HTML and CSS before you commence the programme. This may be from courses in your Bachelor's, but it is also possible to obtain this knowledge through online tutorials, e.g. w3schools.com."
}
</script>
Here's a version that validates:
<script type="application/ld+json">{
"#context": "https://schema.org",
"#type": "Course",
"name": "MSc in IT- Web Communication Design",
"coursePrerequisites": "You are expected to have basic knowledge of HTML and CSS before you commence the programme. This may be from courses in your Bachelor's, but it is also possible to obtain this knowledge through online tutorials, e.g. w3schools.com.",
"occupationalCredentialAwarded": "As a student of the MSc in IT – Web Communication Design you will gain specialised skills in web-based communication and knowledge management. Your choice of elective courses, your projects, your thesis as well as your bachelor background qualify you to work with: Web development, digitalisation, web design, digital skills development, social media, etc.",
"description": "Master of Science in IT Web Communication Design. A multi-disciplinary graduate programme that combines IT, communication and organisation. We emphasise the interaction between humans and information technology and combine research-based knowledge with challenges from practice.",
"provider": {
"#type": "Organization",
"name": "University of Southern Denmark",
"department": "Institute for Design and Communication",
"address": "Universitetsparken 1, 6000 Kolding, Denmark",
"telephone": "+45 65 50 10 00"
}
}</script>
The script expects one top level object, not a list of objects. To get around it you can use #graph. My changes meant there is only one top object anyhow.
This is because you want to connect your information. The organization is the provider of the course, so that information should be in the course object.
I wasn't sure about your EducationalOccupationalCredential. I'm guessing a coursePrerequisites is closer to what you want.

Getting 500 error consistently on Google Sheets API

I am using the Google Sheets API and consistently getting the following error. I am only getting it for a specific sheet with a specific service key. My other credential is working just fine. Also the load relatively load from what I can tell. I'm not railing the API or anything.
{
"error": {
"code": 500,
"message": "Internal error encountered.",
"errors": [
{
"message": "Internal error encountered.",
"domain": "global",
"reason": "backendError"
}
],
"status": "INTERNAL"
}
}
I have found the culprit here. It looks like I had to remove two sheets with Pivot Tables on them referencing a sheet I was trying to query. Once I did that all is well now.
This has to do with their internal timeout. If an operation that you are trying to complete is taking longer, it will bail. Until they fix this a solution may be to reduce the size of data to make the operation quicker. In my case, I update the spreadsheet in smaller chunks.
A workaround for this can be having a central sheet to update data from the Google Sheet API, then reference the range of this data using a formula like IMPORTRANGE on the sheet you need to have your charts and analysis. This way, the sheet that is accessed by the API won't have any chart and wouldn't have any issue.

Why isn't the Google QPX Express API returning results for all airlines?

I enabled access to the Google QPX Express API to do some analytics on the prices of Delta's tickets and Fare Classes. But the response seems to only include flights from a limited set of airlines.
For example, the following request
{
"request": {
"passengers": {
"adultCount": 1
},
"slice": [
{
"origin": "JFK",
"destination": "SFO",
"date": "2015-02-15",
"maxStops": 0
}
],
"solutions": 500
}
}
only returns flights for AS (Alaska Airlines), US (US Air), VX (Virgin America), B6 (JetBlue), and UA (United Airlines).
If I add "permittedCarriers": [DL], then I get an empty response. Likewise, I get an empty response if I leave out permittedCarriers and look for flights between Delta hubs (e.g., "origin": "ATL", "destination": "MSP").
The documentation suggests that QPX Express is supposed to have most airline tickets available. Is there something wrong with my request? Why am I not seeing any results for Delta?
I received a response from Google's QPX Express help team about missing data for Delta. The response was that
Delta's data, as well as American Airline's data, is not included in
QPX Express search results as a default. Access to their data
requires approval by those carriers.
After informing him that my plans to use the data were for research purpsoses, he responded,
American and Delta restrict access to their pricing and availability
to companies which they approve, which are primarily organizations
driving the sale of airline tickets. Unfortunately, requests for
access are only being reviewed for companies that plan to use the API
for commercial purposes.

MongoDB Performance on MongoLab for Geospacial Queries

I have a collection of places with more then 400,000 documents. I am trying to do geospacial queries but they always seem to timeout.
From the MongoLab interface I do a search:
{ "location": {"$near": [ 38, -122 ] } }
And the page just times out.
Also ran this command thru my console:
db.runCommand({geoNear: "places", near: [50,50], num:10})
And it did succeed but took something like 5 minutes to complete.
I do have a Geospatial Index on location.
location { "location" : "2d"}
Is it just impossible to do geospacial queries on such big collections (quite small for a MongoDB collection after all)?
EDIT: MongoLab personally contacted me regarding this problem. It seems there are some issues with my db such as many places not having any coords yet. Also I discovered that using maxDistance accelerates the queries dramatically, which brings me back to this morning's question here : so question
Mongolabs techs have pointed out to me that having a lot of longitude latiudes set to 0,0 and NOT using a maxDistance was what was slowing things down. Adding the maxDistance worked like a charm..
So thanks again to the guy's at Mongolabs.

Resources