Related
I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)
The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.
xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)
I want to storage my data without skipping data header
This is my pig script :
CRE_GM05 = LOAD '$input1' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,T32_001:chararray,TEC_013:chararray,TEC_014:chararray,DAT_001_X:chararray,DAT_002_X:chararray,TEC_001:chararray);
CRE_GM11 = LOAD '$input2' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,DAT_001_X:chararray,DAT_002_X:chararray,D08_001:chararray,PSE_001:chararray,PSE_002:chararray,PSE_003:chararray,RUB_001:chararray,RUB_002:chararray,RUB_003:chararray,RUB_004:chararray,RUB_005:chararray,RUB_006:chararray,RUB_007:chararray,RUB_008:chararray,RUB_009:chararray,RUB_010:chararray,TEC_001:chararray,TEC_002:chararray,TEC_003:chararray,TX_001_VLR:chararray,TX_001_DCM:chararray,D08_004:chararray,D11_004:chararray,RUB_016:chararray,T03_001:chararray);
-- Effectuer une jointure entre les deux tables
JOINED_TABLES = JOIN CRE_GM05 BY TEC_001, CRE_GM11 BY TEC_001;
-- Generer les colonnes
DATA_GM05 = FOREACH JOINED_TABLES GENERATE
CRE_GM05::MGM_COMPTEUR AS MGM_COMPTEUR,
CRE_GM05::CIA_CD_CRV_CIA AS CIA_CD_CRV_CIA,
CRE_GM05::CIA_DA_EM_CRV AS CIA_DA_EM_CRV,
CRE_GM05::CIA_CD_CTRL_BLCE AS CIA_CD_CTRL_BLCE,
CRE_GM05::CIA_IDC_EXTR_RDJ AS CIA_IDC_EXTR_RDJ,
CRE_GM05::CIA_VLR_IDT_CRV_LOQ AS CIA_VLR_IDT_CRV_LOQ,
CRE_GM05::CIA_VLR_REF_CRV AS CIA_VLR_REF_CRV,
CRE_GM05::CIA_VLR_LG_ZON_RTG AS CIA_VLR_LG_ZON_RTG,
CRE_GM05::CIA_HEU_CIA AS CIA_HEU_CIA,
CRE_GM05::CIA_TM_STP_CRE AS CIA_TM_STP_CRE,
CRE_GM05::CIA_VLR_1 AS CIA_VLR_1,
CRE_GM05::CIA_DA_ARR_FIC AS CIA_DA_ARR_FIC,
CRE_GM05::CIA_TY_ENR AS CIA_TY_ENR,
CRE_GM05::CIA_CD_BTE AS CIA_CD_BTE,
CRE_GM05::CIA_CD_PER AS CIA_CD_PER,
CRE_GM05::CIA_CD_EFS AS CIA_CD_EFS,
CRE_GM05::CIA_CD_ETA_VAL_CRV AS CIA_CD_ETA_VAL_CRV,
CRE_GM05::CIA_CD_EVE_CPR AS CIA_CD_EVE_CPR,
CRE_GM05::CIA_CD_APLI_TDU AS CIA_CD_APLI_TDU,
CRE_GM05::CIA_CD_STE_RTG AS CIA_CD_STE_RTG,
CRE_GM05::CIA_DA_TT_RTG AS CIA_DA_TT_RTG,
CRE_GM05::CIA_NO_ENR_RTG AS CIA_NO_ENR_RTG,
CRE_GM05::CIA_DA_VAL_EVE AS CIA_DA_VAL_EVE,
CRE_GM05::T32_001 AS T32_001,
CRE_GM05::TEC_013 AS TEC_013,
CRE_GM05::TEC_014 AS TEC_014,
CRE_GM05::DAT_001_X AS DAT_001_X,
CRE_GM05::DAT_002_X AS DAT_002_X,
CRE_GM05::TEC_001 AS TEC_001;
STORE DATA_GM05 INTO '$OUTPUT_FILE' USING PigStorage(';');
It returns data but I lost the first line of headers !
Note that my $input1 and $input2 variables are csv files
I tried using CSVLoader but it doesn't working also.
I need to get output stored with headers please
In pig final output by default there is no headers coming. Also adding header to final output will doesn't make any sense as sequence of rows is not fixed in pig output.
If you want to add header to final output, either merge all the part files data to a file in local file system where you can add header information explicitly or use hive table to store the output of this pig script. There is HCatlog store can be used for same.
I have jason files saved in S3 bucket. I am trying to load them as dataframe in spark R and I am getting error logs. Following is my code. Where am I going wrong?
devtools::install_github('apache/spark#v2.2.0',subdir='R/pkg',force=TRUE)
library(SparkR)
sc=sparkR.session(master='local')
Sys.setenv("AWS_ACCESS_KEY_ID"="xxxx",
"AWS_SECRET_ACCESS_KEY"= "yyyy",
"AWS_DEFAULT_REGION"="us-west-2")
movie_reviews <-SparkR::read.df(path="s3a://bucketname/reviews_Movies_and_TV_5.json",sep = "",source="json")
I have tried all combinations of s3a , s3n, s3 and none seems to work.
I get following error log in my sparkR console
17/12/09 06:56:06 WARN FileStreamSink: Error while looking for metadata directory.
17/12/09 06:56:06 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
java.lang.reflect.InvocationTargetException
For me it works
read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
What #Ankit said should work, but if you are trying to get something that looks more like a dataframe, you need to use a select statement. i.e.
rdd<- read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
Then do a printSchema(rdd) to see the structure of the data.
If you see something that has root followed by no indentations to your data, you can probably go ahead and select using the names of the "columns" you want. If you see branching down your schema tree, you may have to put a headers.blah or a payload.blah in you select statement. Like this:
sdf<- SparkR::select(rdd, "headers.something", "headers.somethingElse", "payload.somethingInPayload", "payload.somethingElse")
controller page, Json output from api
I'm trying to display posts from the users tumblr account on my view page using ruby. I have never done anything with api's before. I'm trying to use Hash tables. my controller code is as such:
#Posts = client.posts"zombieprocess1.tumblr.com"
on my view page using html I have
<%=Posts%>
the response is such
{"blog"=>{"title"=>"Untitled", "name"=>"zombieprocess1", "total_posts"=>1, "posts"=>1, "url"=>"URL", "updated"=>1478191052, "description"=>"", "is_nsfw"=>false, "ask"=>false, "ask_page_title"=>"Ask me anything", "ask_anon"=>false, "followed"=>false, "can_send_fan_mail"=>true, "is_blocked_from_primary"=>false, "share_likes"=>true, "likes"=>1, "twitter_enabled"=>false, "twitter_send"=>false, "facebook_opengraph_enabled"=>"N", "tweet"=>"N", "facebook"=>"N", "followers"=>0, "primary"=>true, "admin"=>true, "messages"=>0, "queue"=>0, "drafts"=>0, "type"=>"public", "reply_conditions"=>3, "subscribed"=>false, "can_subscribe"=>false}, "posts"=>[{"blog_name"=>"zombieprocess1", "id"=>152689921093, "post_url"=>"URL", "slug"=>"", "type"=>"photo", "date"=>"2016-11-03 16:37:32 GMT", "timestamp"=>1478191052, "state"=>"published", "format"=>"html", "reblog_key"=>"NCDqGTzW", "tags"=>[], "short_url"=>"URL", "summary"=>"", "recommended_source"=>nil, "recommended_color"=>nil, "followed"=>false, "liked"=>true, "note_count"=>1, "caption"=>"", "reblog"=>{"tree_html"=>"", "comment"=>""}, "trail"=>[], "image_permalink"=>"url", "photos"=>[{"caption"=>"", "alt_sizes"=>[{"url"=>"URL", "width"=>400, "height"=>544}, {"url"=>"URL", "width"=>250, "height"=>340}, {"url"=>"URL", "width"=>100, "height"=>136}, {"url"=>"URL", "width"=>75, "height"=>75}], "original_size"=>{"url"=>"URL", "width"=>400, "height"=>544}}], "can_like"=>false, "can_reblog"=>true, "can_send_in_message"=>true, "can_reply"=>true, "display_avatar"=>true}], "total_posts"=>1}
I have tried many different formats and can't seem to just get the post url. My thought is to get the post_url and embed each of them so it shows as it would in tumblr on my webpage. Can anyone help me?
Try this:
#posts['posts'].first['post_url']
Or if your response contains more than one post you can return them all like this:
(0...#posts['total_posts']).map { |i| #posts['posts'][i]['post_url'] }
EDIT: Fixed capitalization in second line of code. I assume you are using lowercase variable as per Ruby convention, '#posts'. If not, you should change to a lowercase variable as this may be confusing something.
In my project, i want to add data :
"\n<p>sadasdasdsad</p>"
i use :
CSParameterCollection parameters = new CSParameterCollection();
parameters.Add("#title", f.Title.ToString());
parameters.Add("#sumary", f.Summary.ToString());
parameters.Add("#link[",f.Links[0].Uri.ToString());
parameters.Add("#datetime", f.PublishDate.ToString());
parameters.Add("#cid", idCate.ToString());
CSDatabase.ExecuteNonQuery("INSERT INTO Acticle (title,sumary,link,datetime,cid) VALUES ('#title','#sumary','#link','#datetime','#cid')",parameters);
but nerver complete .
Please help me !
A quick look at that, you have
parameters.Add("#link[",f.Links[0].Uri.ToString());
Do you need that [ in there, after #link, or is it a typo?