How to access each row of dataframe in sparkR - sparkr

I m running R on spark using sparkR . I have created a data frame of csv file.Now I need to access each row as well data in that row.Is there any method to do that??

In SparkR it is not possible to access data in that row. The possible way would be, is to covert the sparkR data frame to R data frame by using,
>R_people <- collect(people)
head(R_people)
## age name
##1 NA Michael
##2 30 Andy
##3 19 Justin
> R_people$age[3]
#19
#By using this you can filter rows in sparkR data frame `people`
> showDF(filter(people, people$age == R_people$age[3]))
## age name
#1 19 Justin

Related

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0) ---While Tuning gpt2.finetune

Hope you all are doing good ,
I am working on fine tuning GPT 2 model to generate Title based on the content ,While working on it ,I have created a simple CSV files containing only the title to train the model , But while inputting this model to GPT 2 for fine tuning I am getting the following ERROR ,
JSONDecodeError Traceback (most recent call last)
in ()
10 steps=1000,
11 save_every=200,
---> 12 sample_every=25) # steps is max number of training steps
13
14 # gpt2.generate(sess)
3 frames
/usr/lib/python3.7/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 if s.startswith('\ufeff'):
337 s = s.encode('utf8')[3:].decode('utf8')
--> 338 # raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
339 # s, 0)
340 else:
JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)
Below is my code for the above :
import gpt_2_simple as gpt2
model_name = "120M" # "355M" for larger model (it's 1.4 GB)
gpt2.download_gpt2(model_name=model_name) # model is saved into current directory under /models/117M/
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
'titles.csv',
model_name=model_name,
steps=1000,
save_every=200,
sample_every=25) # steps is max number of training steps
I have tried all the basic mechanism of handing UTF -8 BOM but did not find any luck ,Hence requesting your help .It would be a great help from you all .
Try changing the model name because i see you input 120M and the gpt2 model is called 124M

Can someone explain me the output of orcfiledump?

My table test_orc contains (for one partition):
col1 col2 part1
abc def 1
ghi jkl 1
mno pqr 1
koi hai 1
jo pgl 1
hai tre 1
By running
hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0
I get the following:
Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .
Rows: 6 .
Compression: ZLIB .
Compression size: 262144 .
Type: struct<_col0:string,_col1:string> .
Stripe Statistics:
Stripe 1:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
File Statistics:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
Stripes:
Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .
Stream: column 0 section ROW_INDEX start: 3 length 9 .
Stream: column 1 section ROW_INDEX start: 12 length 29 .
Stream: column 2 section ROW_INDEX start: 41 length 29 .
Stream: column 1 section DATA start: 70 length 20 .
Stream: column 1 section LENGTH start: 90 length 12 .
Stream: column 2 section DATA start: 102 length 21 .
Stream: column 2 section LENGTH start: 123 length 5 .
Encoding column 0: DIRECT .
Encoding column 1: DIRECT_V2 .
Encoding column 2: DIRECT_V2 .
What does the part about stripes mean?
First, let's see how an ORC file looks like.
Now some keywords used in above image and also in your question!
Stripe - A chunk of data stored in ORC file. Any ORC file is divided into those chunks, called stripes, each sized 250 MB with index data, actual data and some metadata for actual data stored in that stripe.
Compression - The compression codec used to compress the data stored. ZLIB is the default for ORC.
Index Data - includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Row data - Actual data. Is used in table scans.
Stripe Footer - The stripe footer contains the encoding of each column and the directory of the streams including their location. To describe each stream, ORC stores the kind of stream, the column id, and the stream’s size in bytes. The details of what is stored in each stream depends on the type and encoding of the column.
Postscript - holds compression parameters and the size of the compressed footer.
File Footer - The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.
Now! Talking about your output from orcfiledump.
First is general information about your file. The name, location, compression codec, compression size etc.
Stripe statistics will list all the stripes in your ORC file and its corresponding information. You can see counts and some statistics about Integer columns like min, max, sum etc.
File statistics is similar to #2. Just for the complete file as opposed to each stripe in #2.
Last part, the Stripe section, talks about each column in your file and corresponding index info for each of it.
Also, you can use various options with orcfiledump to get "desired" results. Follows a handy guide.
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump]
[--backup-path <new-path>] <location-of-orc-file-or-directory>
Follows a quick guide to the options used in the commands above.
Specifying -d in the command will cause it to dump the ORC file data
rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex with a comma separated list of column ids will
cause it to print row indexes for the specified columns, where 0 is
the top level struct containing all of the columns and 1 is the first
column id (Hive 1.1.0 and later).
Specifying -t in the command will print the timezone id of the
writer.
Specifying -j in the command will print the ORC file metadata in JSON
format. To pretty print the JSON metadata, add -p to the command.
Specifying --recover in the command will recover a corrupted ORC file
generated by Hive streaming.
Specifying --skip-dump along with --recover will perform recovery
without dumping metadata.
Specifying --backup-path with a new-path will let the recovery tool
move corrupted files to the specified backup path (default: /tmp).
is the URI of the ORC file.
is the URI of the ORC file or
directory. From Hive 1.3.0 onward, this URI can be a directory
containing ORC files.
Hope that helps!

data.frame with Date column ouput in RStudio console,preview, but not below the chunk

Using Rstudio 3.3.2's notebook :
---
title: "R Notebook"
output: html_notebook
---
When trying to display a data.frame with a Date column, the data.frame is displayed in the Viewer tab, but not below the chunk itself :
```{r}
df <- data.frame(date=c("31/08/2011", "31/07/2011", "30/06/2011"),values=c(0.8378,0.8457,0.8147))
#no Date format ->OK, output below the chunk
df
df$dateformatted<-as.Date(strptime(df$date,'%d/%m/%Y'))
#with Date format -> NOK, no output below the chunk,only in Viewer.
df
```
RStudio diagnostics :
26 Feb 2017 20:42:00 [rsession-x] ERROR r error 7 (Unexpected data type); OCCURRED AT: rstudio::core::Error rstudio::r::json::{anonymous}::jsonValueFromVectorElement(SEXP, int, rstudio::core::json::Value*) /home/ubuntu/rstudio/src/cpp/r/RJson.cpp:149; LOGGED FROM: void rstudio::session::modules::rmarkdown::notebook::enqueueChunkOutput(const string&, const string&, const string&, unsigned int, ChunkOutputType, const rstudio::core::FilePath&, const Value&) /home/ubuntu/rstudio/src/cpp/session/modules/rmarkdown/NotebookOutput.cpp:449
relates to this question.
Does anyone knows what did I do wrong ? Thanks a lot in advance.
This is indeed a bug in the current release of RStudio: data.frames containing Date objects are not rendered properly in notebooks. You might try installing the latest daily build of RStudio and confirming the issue is resolved there:
http://dailies.rstudio.com
I appreciate Rigoberta's and Kevin's posts. I'm having the same problem (rstudio 1.0.136).
I'm afraid of using the latest daily build as per described in http://dailies.rstudio.com: "Daily builds are intended for testing purposes, and are not recommended for general use. For stable builds, please visit rstudio.com."
As I never used "unstable" versions of rstudio it seems a better approach to rollback rstudio version for now but opinions are appreciated.
While waiting to decide whether to move back to RStudio 1.0.44 or move forward to an "unstable" version, I found out the issue doesn't happen with matrix objects, to, temporarily, I'm using print(as.matrix()):
```{r}
df <- data.frame(date = c("31/08/2011", "31/07/2011", "30/06/2011"), values = c(0.8378, 0.8457, 0.8147))
df$dateformatted <- as.Date(strptime(df$date, '%d/%m/%Y'))
print(as.matrix(df), quote = FALSE)
```
date values dateformatted
[1,] 31/08/2011 0.8378 2011-08-31
[2,] 31/07/2011 0.8457 2011-07-31
[3,] 30/06/2011 0.8147 2011-06-30
To simulate the head()'s behaviour:
print(as.matrix(df), quote = FALSE, max = length(df) * 6)
You can use this function
bf <- function(x) x %>% ungroup() %>% mutate_if(is.Date, as.character)
to make dataframes containing dates display as expected
```{r}
data.frame(date = as.Date(Sys.time()), num = 1:3) %>% bf
```
date num
2017-03-18 1
2017-03-18 2
2017-03-18 3
3 rows

Spotfire Custom Expression : Calculate (Num/Den) Percentages

I am trying to plot Num/Den type percentages using OVER. But my thoughts does not appear to translate into spotfire custom expression syntax.
Sample Input:
RecordID CustomerID DOS Age Gender Marker
9621854 854693 09/22/15 37 M D
9732721 676557 09/18/15 65 M D
9732700 676557 11/18/15 65 M N
9777003 5514882 11/25/15 53 M D
9853242 1753256 09/30/15 62 F D
9826842 1260021 09/30/15 61 M D
9897642 3375185 09/10/15 74 M N
9949185 9076035 10/02/15 52 M D
10088610 3512390 09/16/15 33 M D
10120650 41598 10/11/15 67 F N
9949185 9076035 10/02/15 52 M D
10088610 3512390 09/16/15 33 M D
10120650 41598 09/11/15 67 F N
Expected Out:
Row Labels D Cumulative_D N Cumulative_N Percentage
Sep 6 6 2 2 33.33%
Oct 2 8 1 3 37.50%
Nov 1 9 1 4 44.44%
My counts are working.
I want to take the same Cumulative_N & Cumulative_D count and plot Percentage over [Axis.X] as a line chart.
Here's what I am using:
UniqueCount(If([Marker]="N",[CustomerID])) / UniqueCount(If([Marker]="D",[CustomerID])) THEN SUM([Value]) OVER (AllPrevious([Axis.X])) as [CumulativePercent]
I understand SUM([Value]) is not the way to go. But I don't know what to use instead.
Also tried the one below as well, but did not:
UniqueCount(If([Marker]="N",[CustomerID])) OVER (AllPrevious([Axis.X])) / UniqueCount(If([Marker]="D",[CustomerID])) OVER (AllPrevious([Axis.X])) as [CumulativePercent]
Can you have a look ?
I found a way to make it work, but it may not fit your overall solution. I should mention i used Count() versus UniqueCount() so that the results would mirror your desired output.
Add a transformation to your current data table
Insert a calculated column Month([DOS]) as [TheMonth]
Set Row Identifers = [TheMonth]
Set value columns and aggregation methods to Count([CustomerID])
Set column titles to [Marker]
Leave the column name pattern as %M(%V) for %C
That will give you a new data table. Then, you can do your cumulative functions. I did them in a cross table to replicate your expected results. Insert a new cross table and set the values to:
Sum([D]), Sum([N]), Sum([D]) OVER (AllPrevious([Axis.Rows])) as [Cumulative_D],
Sum([N]) OVER (AllPrevious([Axis.Rows])) as [Cumulative_N],
Sum([N]) OVER (AllPrevious([Axis.Rows])) / Sum([D]) OVER (AllPrevious([Axis.Rows])) as [Percentage]
That should do it.
I don't know if Spotfire released a fix or based on everyone's inputs I could get the syntax right. But here is the solution that worked for me.
For Columns D & N,
COUNT([CustomerID])
For columns Cumulative_D & Cumulative_N,
Count([CustomerID]) OVER (AllPrevious([Axis.X])) where [Axis.X] is DOS(Month), Marker
For column Percentage,
Count(If([Marker]="N",[CustomerID])) OVER (AllPrevious([Axis.X])) / Count(If([Marker]="D",[CustomerID])) OVER (AllPrevious([Axis.X]))
where [Axis.X] is DOS(Month)

Pig 0.12.0 - extracting last two characters from a string

I am using CDH 5.5, Pig 0.12.0. I have a chararray like this: 25 - 45 and I want to extract 25 and 45 out of this String.
So, I did this:
minValue = (int)SUBSTRING(value,0,2);
maxValue = ((int)SUBSTRING(value,6,2);
I am able to extract minValue but unable to extract the maxValue i.e. last two characters of the given String.
Even I tried but this one is also not working.:
maxValue = ((int)SUBSTRING(value,-2,2);
Please let me know how to make this work.
If delimeter is colon ( - ) always, then we can split and flatten the chararray to extract min and max value.
A = LOAD 'input.csv' USING PigStorage(',') AS (min_max:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(min_max,' - ',0)) AS (min_val:chararray, max_val:chararray);
DUMP B;
Input :
25 - 45
35 - 65
45 - 85
Output :
(25,45)
(35,65)
(45,85)
You have to use the index of the specific character in the SUBSTRING function.
Here is what you need.
maxValue = (int)SUBSTRING(value,5,7);

Resources