Loading of json files from S3 to sparkR dataframe - sparkr

I have jason files saved in S3 bucket. I am trying to load them as dataframe in spark R and I am getting error logs. Following is my code. Where am I going wrong?
devtools::install_github('apache/spark#v2.2.0',subdir='R/pkg',force=TRUE)
library(SparkR)
sc=sparkR.session(master='local')
Sys.setenv("AWS_ACCESS_KEY_ID"="xxxx",
"AWS_SECRET_ACCESS_KEY"= "yyyy",
"AWS_DEFAULT_REGION"="us-west-2")
movie_reviews <-SparkR::read.df(path="s3a://bucketname/reviews_Movies_and_TV_5.json",sep = "",source="json")
I have tried all combinations of s3a , s3n, s3 and none seems to work.
I get following error log in my sparkR console
17/12/09 06:56:06 WARN FileStreamSink: Error while looking for metadata directory.
17/12/09 06:56:06 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
java.lang.reflect.InvocationTargetException

For me it works
read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")

What #Ankit said should work, but if you are trying to get something that looks more like a dataframe, you need to use a select statement. i.e.
rdd<- read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
Then do a printSchema(rdd) to see the structure of the data.
If you see something that has root followed by no indentations to your data, you can probably go ahead and select using the names of the "columns" you want. If you see branching down your schema tree, you may have to put a headers.blah or a payload.blah in you select statement. Like this:
sdf<- SparkR::select(rdd, "headers.something", "headers.somethingElse", "payload.somethingInPayload", "payload.somethingElse")

Related

Google Cloud DLP - CSV inspection

I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)
The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.
xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)

Individually update a large amount of documents with the Python DSL Elasticsearch UpdateByQuery

I'm trying to use the UpdateByQuery to update a property of a large amount of documents. But as each document will have a different value, I need to execute ir one by one. I'm traversing a big amount of documents, and for each document I call this funcion:
def update_references(self, query, script_source):
try:
ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source)
ubq.execute()
except Exception as err:
return False
return True
Some example values are:
query = {'query': {'match': {'_id': 'VpKI1msBNuDimFsyxxm4'}}}
script_source = 'ctx._source.refs = [\'python\', \'java\']'
The problem is that when I do that, I got an error: "Too many dynamic script compilations within, max: [75/5m]; please use indexed, or scripts with parameters instead; this limit can be changed by the [script.max_compilations_rate] setting".
If I change the max_compilations_rate using Kibana, it has no effect:
PUT _cluster/settings
{
"transient": {
"script.max_compilations_rate": "1500/1m"
}
}
Anyway, it would be better to use a parametrized script. I tried:
def update_references(self, query, script_source, script_params):
try:
ubq = UpdateByQuery(using=self.client, index=self.index).update_from_dict(query).script(source=script_source, params=script_params)
ubq.execute()
except Exception as err:
return False
return True
So, this time:
script_source = 'ctx._source.refs = params.value'
script_params = {'value': [\'python\', \'java\']}
But as I have to update the query and the parameters each time, I need to create a new instance of the UpdateByQuery for each document in the large collection, and the result is the same error.
I also tried to traverse and update the large collection with:
es.update(
index=kwargs["index"],
doc_type="paper",
id=paper["_id"],
body={"doc": {
"refs": paper["refs"] # e.g. [\\'python\\', \\'java\\']
}}
)
But I'm getting the following error: "Failed to establish a new connection: [Errno 99] Cannot assign requested address juil. 10 18:07:14 bib gunicorn[20891]: POST http://localhost:9200/papers/paper/OZKI1msBNuDimFsy0SM9/_update [status:N/A request:0.005s"
So, please, if you have any idea on how to solve this it will be really appreciated.
Best,
You can try it like this.
PUT _cluster/settings
{
"persistent" : {
"script.max_compilations_rate" : "1500/1m"
}
}
The version update is causing these errors.

Cloudwatch to Elasticsearch parse/tokenize log event before push to ES

Appreciate your help in advance.
In my scenario - Cloudwatch multiline logs needs to be shipped to elasticsearch service.
ECS--awslog->Cloudwatch---using lambda--> ES Domain
(Basic flow though very open to change how data is shipped from CW to ES )
I was able to solve multi-line issue using multi_line_start_pattern BUT
The main issue I am experiencing now - is my logs have ODL format (following format)
[yyyy-mm-ddThh:mm:ss.SSS-Z][ProductName-Version][Log Level]
[Message ID][LoggerName][Key Value Pairs][[
Message]]
AND I will like to parse and tokenize log events before storing in ES (vs the complete log line ).
For example:
[2018-05-31T11:08:49.148-0400] [glassfish 4.1] [INFO] [] [] [tid: _ThreadID=43 _ThreadName=Thread-8] [timeMillis: 1527692929148] [levelValue: 800] [[
[] INFO : (DummyApplicationFunctionJPADAO) EntityManagerFactory located under resource lookup name [null], resource name=AuthorizationPU]]
Needs to be parsed and tokenize using format
timestamp 2018-05-31T11:08:49.148-0400
ProductName-Version glassfish 4.1
LogLevel INFO
MessageID
LoggerName
KeyValuePairs tid: _ThreadID=43 _ThreadName=Thread-8
Message [] INFO : (DummyApplicationFunctionJPADAO)
EntityManagerFactorylocated under resource lookup name
[null], resource name=AuthorizationPU
In above Key Value pairs repeat and are variable - for simplicity I can store all as one long string.
As far as what I gathered about Cloudwatch - It seems Subscription Filter Pattern reg ex support is very limited really not sure how to fit the above pattern. For lambda function that pushes the data to ES have not seen AWS doc or examples that support lambda as means to parse and push for ES.
Will appreciate if someone can please guide what/where will be best option to parse CW logs before it gets into ES => Subscription Filter -Pattern vs in lambda function or any other way.
Thank you .
From what I can see your best bet is what you're suggesting, a CloudWatch log triggered lambda that reformats the logged data into your ES prefered format and then posts it into ES.
You'll need to subscribe this lambda to your CloudWatch logs. You can do this on the lambda console, or the cloudwatch console (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html).
The lambda's event payload will be: { "awslogs": { "data": "encoded-logs" } }. Where encoded-logs is a Base64 encoding of a gzipped JSON.
For example, the sample event (https://docs.aws.amazon.com/lambda/latest/dg/eventsources.html#eventsources-cloudwatch-logs) can be decoded in node, for example, using:
const zlib = require('zlib');
const data = event.awslogs.data;
const gzipped = Buffer.from(data, 'base64');
const json = zlib.gunzipSync(gzipped);
const logs = JSON.parse(json);
console.log(logs);
/*
{ messageType: 'DATA_MESSAGE',
owner: '123456789123',
logGroup: 'testLogGroup',
logStream: 'testLogStream',
subscriptionFilters: [ 'testFilter' ],
logEvents:
[ { id: 'eventId1',
timestamp: 1440442987000,
message: '[ERROR] First test message' },
{ id: 'eventId2',
timestamp: 1440442987001,
message: '[ERROR] Second test message' } ] }
*/
From what you've outlined, you'll want to extract the logEvents array, and parse this into an array of strings. I'm happy to give some help on this too if you need it (but I'll need to know what language you're writing your lambda in- there are libraries for tokenizing ODL- so hopefully it's not too hard).
At this point you can then POST these new records directly into your AWS ES Domain. Somewhat crypitcally the S3-to-ES guide gives a good outline of how to do this in python: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-aws-integrations.html#es-aws-integrations-s3-lambda-es
You can find a full example for a lambda that does all this (by someone else) here: https://github.com/blueimp/aws-lambda/tree/master/cloudwatch-logs-to-elastic-cloud

HbaseStorageHandler plugin in Drill

I am able to query hive,hbase individually by using Drill.Now i am trying to query HbaseStorageHandler type tables in hive. For this in Drill, Hive Storage Plugin I added these properties as,
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://trinitybdClusterM02.trinitymobility.local:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true",
"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
"fs.default.name": "hdfs://trinitybdClusterM02.trinitymobility.local:9000",
"hive.metastore.sasl.enabled": "false",
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "2181"
}
}
I tried to query like,
0: jdbc:drill:zk=localhost> use hive.test;
0: jdbc:drill:zk=localhost> select * from twitter_test_nlp limit 1;
It is giving error as,
Error: SYSTEM ERROR: NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setAttribute(Ljava/lang/String;[B)V
Fragment 0:0
[Error Id: fc3994f4-7d7e-475e-870b-259ac91ea81a on trinitybdClusterM02.trinitymobility.local:31010] (state=,code=0)
Anybody is using this type please share me what properties I have to add for query HBaseStorageHandler tables of Hive.
In drill 1.9 this problem has resolved. drill 1.9 directly supports HbaseStorageHandler tables(Hive and hbase integrated tables) also with hive storage plug-in.And it directly supports spatial queries also like st_contains() etc.So if anybody need these type of requirements use drill 1.9.0.

sparkr 2.0 read.df throws path does not exist error

My spark r 1.6 code does not work in spark2.0, I made necessary changes like creating sparkr.session() instead of sparkr.init() and not passing sqlcontext parameter etc…
In the code below I am loading data from couple folders into a dataframe
read.df in spark1.6 that works
sales <- read.df(sqlContext, path= "gs://dev.appspot.com/myData/2014/20*,gs://dev.appspot.com/myData/2015/20*", source = "com.databricks.spark.csv", delimiter
="\t")
read.df in spark2.0 that does not work
sales <- read.df("gs://dev.appspot.com/myData/2014/20*,gs://dev.appspot.c
om/myData/2015/20*", source = "com.databricks.spark.csv", delimiter="\t")
the above line throws following error:
6/09/25 19:28:52 ERROR org.apache.spark.api.r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils faile d Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.spark.sql.AnalysisException: **Path does not exist: gs://dev.appspot.com/myData/2014/ 20*,gs://dev.appspot.com/myData/2015/20***;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122 Calls: read.df -> dispatchFunc -> f -> callJStatic -> invokeJava Execution halted 16/09/25 19:28:53 INFO org.spark_project.jetty.server.ServerConnector: Stopped ServerConnector#148bd6fd{HTTP/1.1}{0 .0.0.0:4040}
spark2.0 read.df is failing on reading files that has ","(comma) in the file name.
Data files that I generated has a comma in
the files names, something like these 201448-0,004 201448-0,005
201448-0,006
After painfull hours in debugging through the issue, finally it started reading the data when I removed "," from files names.

Resources