Unable to access documenst containg predicted label - text-classification

I am using RTextTools to execute a model I previously trained. I am unable to extract the document and its predicted label. I have tried looking at analytics#document_summary and inspecting te container, but need some help getting the approximately 25K documenst and the predicted labels into a data frame for subsequent processing. Here is the relevant snippett of code:
preddata = read form csv
predsize = nrow(preddata)
preddata$topic <- rep(NA,predsize)
predmatrix <- create_matrix(preddata, originalMatrix=dtMatrix)
predcontainer <- create_container(predmatrix, preddata$topic, testSize=1:predsize, virgin=TRUE)
predresults <- classify_models(predcontainer, models)

Related

Shiny app - aggregate data set by selection_filter and create new variables

I am quite new to R and have been trying to find a solution for my problem since weeks. I hope someone can help me.
1.I want to develop a shiny app in a dashboard, where the user can select values via selection_filter (e.g. out of the variable "age group" the value "40-49 years" and from "sex" the value "female"). Based on these selections, columns (e.g. column x,y, and z) from the original dataset will be aggregated. I already wrote a function using aggregate().
2.Based on the aggregated columns, new values shall be calculated (e.g. d=(x-y)/(z/2)).
3.The aggregated columns and the newly calculated values shall be displayed in a table to the user.
The function from 1)
aggreg.function <- function(a,b,c) {
agg.data<- aggregate(cbind(x,y,z), shared_Cervix, sum,
subset=c(!AgeGroup %in% a & !Sex %in% b & !Edition %in% c))
#Calculate new values
agg.data$d<- agg.data$x+agg.data$y
agg.data$f<- (agg.data$x+agg.data$y)/(agg.data$z/2)
View(m.agg.data)
}
user_data<- reactive({
aggreg.function(input$AgeGroup, input$Sex, input$Edition)
})
EDIT
Thanks for the recommendations. I change my code but now I struggle a bit with adding new columns. In total, I want to insert 17 new columns based on the filtered table (data_step2()). Is there a way to insert multiple columns at the same time. In my example: is it possible to combine data_step3 and data_step4?
ui<-fluidPage(
selectInput("Age","Age:",sort(unique(Complete_test$Age))),
selectInput("Raced","Race:",sort(unique(Complete_test$Race))),
selectInput("Stage","Stage:",sort(unique(Complete_test$Stage))),
selectInput("Grade","Grade:",sort(unique(Complete_test$Grade))),
selectInput("Edition","Edition:",sort(unique(Complete$Edition))),
DT::dataTableOutput("filtered.result")
)
server = function(input, output) {
data_step1 <- reactive({
Complete%>% filter(Age %in% input$Age & Stage %in% input$Stage & Grade %in% input$Grade & Race %in% input$Race & Edition%in% input$_Edition)})
data_step2 <- reactive({
data_step1() %>% group_by(Age, Stage, Grade, Race, Edition, Year ) %>% summarise(across(everything(), sum))
})
#is it possible to combine data_step3 and data_step_4 ?
data_step3 <- reactive({
data_step2() %>% mutate(xy=x+y)
})
data_step4 <- reactive({
data_step3() %>% mutate(w=xy/(x2))
})
output$filtered.result <- DT::renderDataTable({
data_step4()
})
}
shinyApp(ui, server)
```
Here's something you might be able to do for step 1 of your question, although it's hard to tell what your data might look like and what you end result should look like. I'm assuming that you want to allow a user of your dashboard to select the AgeGroup and Sex variables to view data, so in the UI of the application, two selectInput functions were used to provide that functionality. For the server side, two reactive statements are used to filter the data as a user changes the inputs. In each one, an input is required and then the dataset is filtered by the input. Notice in the second reactive statement, "data_step1()" is used instead of "data_step1"; this allows the reactivity of these statements to continue to be updated as a user changes inputs.
For step 2 and 3, you'll want to use the "data_step2()" dataset to add the new columns and then you can use a function such as renderDataTable to display the output on your dashboard.
library(shiny)
library(dplyr)
ui <- fluidPage(
selectInput("AgeGroup", "AgeGroup:", sort(unique(Data$AgeGroup))),
selectInput("Sex", "Sex:", sort(unique(Data$Sex)))
)
server <- function(input,output,session){
data_step1 <- reactive({
req(input$AgeGroup) # Require input
data %>% filter(AgeGroup %in% input$AgeGroup)}) # Filter full dataset by AgeGroup input
data_step2 <- reactive({
req(input$Sex) # Require input
data_step1() %>% filter(Sex %in% input$Sex)}) # Filter full dataset by Sex input
# Do next steps with the filtered data_step2() dataset...
}
# Run the application
shinyApp(ui = ui, server = server)

h2o H2OAutoEncoderEstimator

I was trying to detect outliers using the H2OAutoEncoderEstimator.
Basically I load 4 KPIs from a CSV file.
For each KPI I have 1 month of data.
The data in the CSV file has been manually created and are all the same for each KPI
The following picture shows the trend of the KPIs:
The first black vertical line (x=4000) indicates the end of the training data.
All the others light black vertical lines indicate the data that I use to detect the outliers every time.
As you can see data are very regular (I'v copied & pasted first 1000 rows 17 times).
This is what my code does:
Loads the data from a CSV file (1 row represents the value of all kpis in a specific timestamp)
Trains the model using the first 4000 timestamps
Starting from the 4001 timestamp, every 250 Timestamps it calls the function model.anomaly to detect the outliers in a specific window (250 timestamps)
My questions are:
Is it normal that every time that I call the function model.anomaly the errors returned increases every time (from 0.1 to 1.8)?
If I call again model.train, the training phase will be performed from scratch replacing the existing model or it will be updated with the new data provided?
This is my python code:
data = loadDataFromCsv()
nOfTimestampsForTraining = 4000
frTrain = h2o.H2OFrame(data[:nOfTimestampsForTraining])
colsName = frTrain.names
model = H2OAutoEncoderEstimator(activation="Tanh",
hidden=[5,4,3],
l1=1e-5,
ignore_const_cols=False,
autoencoder=True,
epochs=100)
# Init indexes
nOfTimestampsForWindows = 250
fromIndex = nOfTimestampsForWindows
toIndex = fromWindow + nOfTimestampsForWindows
# Perform the outlier detection every nOfTimestampsForWindows TimeStamps
while toIndex <= len(data) :
frTest = h2o.H2OFrame(data[fromWindow:toWindow])
error = model.anomaly(frTest)
df = error.as_data_frame()
print(df)
print(df.describe())
# Adjust indexes for the next window
fromIndex = toIndex
toIndex = fromIndex + nOfTimestampsForWindows

union not happening with Spark transform

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

How to achieve dimensional charting on large dataset?

I have successfully used combination of crossfilter, dc, d3 to build multivariate charts for smaller datasets.
My current system caters to 1.5 million txns a day and I want to use the above combination to show dimensional charts on this big sized data (spanned over 6 months). I cannot push this sized data to the frontend for obvious reasons.
The txn data has seconds level granularity but this level of granularity is not required in the visualization. If txn data can be rolled up to a granularity of a day at the backend and push the day based aggregation to the front end then it can drastically reduce the IO traffic and size of the data given to the crossfilter,dc and then dc can show its visualization magic.
Taking forward the above idea -> I decided to reduce the size of the data by reducing the granularity of the timeseries data from millseconds to day by pre-aggregating the data from various dimensions using the below GROUP BY query (this is similar to the stuff done by crossfilter but at the frontend)
SELECT TRUNC(DATELOGGED) AS DTLOGGED, CODE, ACTION, COUNT(*) AS
TXNCOUNT, GROUPING_ID(TRUNC(DATELOGGED),CODE, ACTION) AS grouping_id
FROM AAAA GROUP BY GROUPING SETS(TRUNC(DATELOGGED),
(TRUNC(DATELOGGED),CURR_CODE), (TRUNC(DATELOGGED),ACTION));
Sample output of these rows:
Tuples/Rows in which aggregation is done by (TRUNC(DATELOGGED),CODE) will have a common grouping_id 1 and by (TRUNC(DATELOGGED),ACTION) will have a common grouping_id 2
//group by DTLOGGED, CODE
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":69,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":20,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":254,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":961,"GROUPING_ID":1},
//group by DTLOGGED, ACTION
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":373600,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":48978,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":402311,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":54910,"GROUPING_ID":2},
//group by DTLOGGED
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":460732,"GROUPING_ID":3},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":496060,"GROUPING_ID":3}];
Questions:
These rows are are dis-joined i.e. not like usual rows where each row will have valid values for CODE and ACTION in a single row.
After a selection is made in one of the graphs, the redrawing effect either removes the other graphs or shows no data on them.
Please give me any troubleshooting help or suggest better ways to solve this?
http://jsfiddle.net/universallocalhost/5qJjT/3/
So there are a couple things going on in this question, so I'll try to separate them:
Crossfilter works with tidy data
http://vita.had.co.nz/papers/tidy-data.pdf
This means that you will need to come up with a naive method of filling in the nulls you're seeing (or if need be, in your initial query of the data, omit the nulled values. If you want to get really fancy, you could even infer the null values based off of other data. Whatever your solution, you need to make your data tidy prior to putting it into crossfilter.
Groups and Filtering Operations
txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
if(d.GROUPING_ID ===1) {
return d.TXNCOUNT;
} else {
return 0;
}
});
This is a filtering operation done on the reduction. This is something that you should separate. Allow that filtering to occur elsewhere (either in the visual, crossfilter itself, or in the query on the data).
This means your reduceSum's become:
var txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
return d.TXNCOUNT;
});
And if you would like the user to select which group to display:
var groupId = cfdata.dimension(function(d) { return d.GROUPING_ID; });
var groupIdGroup = groupId.group(); // this is an interesting name
dc.pieChart("#group-chart")
.width(250)
.height(250)
.radius(125)
.innerRadius(50)
.transitionDuration(750)
.dimension(groupId)
.group(groupIdGroup)
.renderLabel(true);
For an example of this working:
http://jsfiddle.net/b67pX/

Efficient way to delete multiple rows in HBase

Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase?
There is a table say 'chart', which contains items that are in charts. Row keys are in the following format:
chart|date_reversed|ranked_attribute_value_reversed|content_id
Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row found by a Scan? All the rows to be deleted are going to be close to each other.
I need to delete the rows, because I don't want one item (one content_id) to have multiple entries which it will have if its ranked_attribute_value had been changed (its change is the reason why chart needs to be regenerated).
Being a HBase beginner, so perhaps I might be misusing rows for something that columns would be better -- if you have a design suggestions, cool! Or, maybe the charts are better generated in a file (e.g. no HBase for output)? I'm using MapReduce.
Firstly, coming to the point of range delete there is no range delete yet in HBase, AFAIK. But there is a way to delete more than one rows at a time in the HTableInterface API. For this simply form a Delete object with row keys from scan and put them in a List and use the API, done! To make scan faster do not include any column family in the scan result as all you need is the row key for deleting whole rows.
Secondly, about the design. First my understanding of the requirement is, there are contents with content id and each content has charts generated against them and those data are stored; there can be multiple charts per content via dates and depends on the rank. In addition we want the last generated content's chart to show at the top of the table.
For my assumption of the requirement I would suggest using three tables - auto_id, content_charts and generated_order. The row key for content_charts would be its content id and the row key for generated_order would be a long, which would auto-decremented using HTableInterface API. For decrementing use '-1' as the amount to offset and initialize the value Long.MAX_VALUE in the auto_id table at the first start up of the app or manually. So now if you want to delete the chart data simply clean the column family using delete and then put back the new data and then make put in the generated_order table. This way the latest insertion will also be at the top in the latest insertion table which will hold the content id as a cell value. If you want to ensure generated_order has only one entry per content save the generated_order id first and take the value and save it into content_charts when putting and before deleting the column family first delete the row from generated_order. This way you could lookup and charts for a content using 2 gets at max and no scan required for the charts.
I hope this is helpful.
You can use the BulkDeleteProtocol which uses a Scan that defines the relevant range (start row, end row, filters).
See here
I ran into your situation and this is my code to implement what you want
Scan scan = new Scan();
scan.addFamily("Family");
scan.setStartRow(structuredKeyMaker.key(starDate));
scan.setStopRow(structuredKeyMaker.key(endDate + 1));
try {
ResultScanner scanner = table.getScanner(scan);
Iterator<Entity> cdrIterator = new EntityIteratorWrapper(scanner.iterator(), EntityMapper.create(); // this is a simple iterator that maps rows to exact entity of mine, not so important !
List<Delete> deletes = new ArrayList<Delete>();
int bufferSize = 10000000; // this is needed so I don't run out of memory as I have a huge amount of data ! so this is a simple in memory buffer
int counter = 0;
while (entityIterator.hasNext()) {
if (counter < bufferSize) {
// key maker is used to extract key as byte[] from my entity
deletes.add(new Delete(KeyMaker.key(entityIterator.next())));
counter++;
} else {
table.delete(deletes);
deletes.clear();
counter = 0;
}
}
if (deletes.size() > 0) {
table.delete(deletes);
deletes.clear();
}
} catch (IOException e) {
e.printStackTrace();
}

Resources