Quanteda: display the actual difference between texts - difference

I managed to calculate the difference between two texts with the cosine method. With the following:
library("quanteda")
dfmat <- corpus_subset(corpusnew) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("portuguese")) %>%
dfm()
(tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
as.matrix(tstat1)
And I get the following matrix:
text1 text2 text3 text4 text5
text1 1.000 0.801 0.801 0.801 0.798
However, I would like to know the actual words that account for the difference and not by how much they differ or are alike. Is there a way?
Thanks

How about comparing tokens using setdiff()?
require(quanteda)
toks <- tokens(corpus(c("a b c d", "a e")))
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "b" "c" "d"
#>
#> text2 :
#> [1] "a" "e"
setdiff(toks[[1]], toks[[2]])
#> [1] "b" "c" "d"
setdiff(toks[[2]], toks[[1]])
#> [1] "e"

This question only has pairwise answers, since each computation of similarity occurs between a single pair of documents. It's also not entirely clear what output you want to see, so I'll take my best guess and demonstrate a few possibilities.
So if you wanted to the features most different between text1 and text2, for instance, you could slice the documents you want to compare from the dfm, and then change margin = "features" to get the similarity of the document across features.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
dfmat <- tokens(data_corpus_inaugural[1:5], remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
library("quanteda.textstats")
sim <- textstat_simil(dfmat[1:2, ], margin = "features", method = "cosine")
Now we can examine the pairwise similarities (greatest and smallest) by converting the similarity matrix to a data.frame, and sorting it.
# most similar features
as.data.frame(sim) %>%
dplyr::arrange(desc(cosine)) %>%
dplyr::filter(cosine < 1) %>%
head(10)
#> feature1 feature2 cosine
#> 1 present may 0.9994801
#> 2 country may 0.9994801
#> 3 may government 0.9991681
#> 4 present citizens 0.9988681
#> 5 country citizens 0.9988681
#> 6 present people 0.9988681
#> 7 country people 0.9988681
#> 8 present united 0.9988681
#> 9 country united 0.9988681
#> 10 present government 0.9973337
# most different features
as.data.frame(sim) %>%
dplyr::arrange(cosine) %>%
head(10)
#> feature1 feature2 cosine
#> 1 government upon 0.1240347
#> 2 government chief 0.1240347
#> 3 government magistrate 0.1240347
#> 4 government proper 0.1240347
#> 5 government arrive 0.1240347
#> 6 government endeavor 0.1240347
#> 7 government express 0.1240347
#> 8 government high 0.1240347
#> 9 government sense 0.1240347
#> 10 government entertain 0.1240347
Created on 2022-03-08 by the reprex package (v2.0.1)
There are other ways to compare the words most different between documents, such as "keyness" - for instance quanteda.textstats::textstat_keyness() between text1 and text2, where the head and tail of the resulting data.frame will tell you the most dissimilar features.

Related

How to create means in panel data for specific years?

I need help in a particular issue with Stata. I have a panel dataset by id year from 1996 to 2018.
The panel data is a combination of world countries and regions, yearly observations, for 7 different crops, area cultivated.
I would like to create a mean around years 2000, 2010 and 2018, so that mean(year2000)= mean of (1999+2000+2001), mean(year2010)=mean from (2009+2010+2011) and mean(year2018)= mean from (2016+2017+2018) for every crop from my 7 crops selection.
Then the problem is even more complicated when I need to combine some countries to form sub-regions: say I need the sub-region RUS1 = Russia + Ukraine. How can I create another variable that shows the total from crop1 between crop1 area cultivated in Russia + crop1 area cultivated in Ukraine on yearly basis. Meaning another variable that shows these sums for each year using the above means.
I've tried with by id year: egen area_rus1=total(area) if area=="Russia" & area=="Ukraine"
but nothing works.
The names of area being strings I used encode (area), gen (area2) and automatically Stata generates a number.
In order to create a panel dataset i've used gen id=area2+itemcode
The panel data looks like this after sort year
Please be aware that the period is 1996-2018. The example above shows only year 1996.
This didn't get much of a response, for several reasons:
You didn't show very much code.
You didn't show data in a form that is especially useful. An image can't be copied and pasted easily into someone's Stata to allow experiment. In fact your image shows variables that are irrelevant and variables that are different versions of each other and so is much more complicated than we need.
You escalated the question to ask the most complicated version of what you want to know.
There is a problem you should have explained better. area is string and so totals can't be calculated at all and area2 is just arbitrary integers so totals can be calculated but don't make sense. "nothing works" is not informative as a problem report. The only totals that make sense to me are totals of value.
You need to simplify your problem first and then build up.
The essence seems to be as follows:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 country str6 item float year str1 region float value
"A" "barley" 1999 "X" 1
"B" "barley" 1999 "X" 2
"C" "barley" 1999 "Y" 3
"A" "barley" 2000 "X" 4
"B" "barley" 2000 "X" 5
"C" "barley" 2000 "Y" 6
"A" "barley" 2001 "X" 7
"B" "barley" 2001 "X" 8
"C" "barley" 2001 "Y" 9
end
* means by countries: similar variables for other periods
egen mean_9901_c = mean(cond(inrange(year, 1999, 2001), value, .)), by(country item)
* aggregation to regions, but ensure that you don't double count
egen value_region = total(value), by(region item year)
egen tag = tag(region item year)
* means by regions: similar variables for other periods
egen mean_9901_r = mean(cond(tag == 1 & inrange(year, 1999, 2001), value_region, .)), by(region item)
list, sepby(year)
+---------------------------------------------------------------------------------+
| country item year region value mean_9~c value_~n tag mean_9~r |
|---------------------------------------------------------------------------------|
1. | A barley 1999 X 1 4 3 1 9 |
2. | B barley 1999 X 2 5 3 0 9 |
3. | C barley 1999 Y 3 6 3 1 6 |
|---------------------------------------------------------------------------------|
4. | A barley 2000 X 4 4 9 1 9 |
5. | B barley 2000 X 5 5 9 0 9 |
6. | C barley 2000 Y 6 6 6 1 6 |
|---------------------------------------------------------------------------------|
7. | A barley 2001 X 7 4 15 1 9 |
8. | B barley 2001 X 8 5 15 0 9 |
9. | C barley 2001 Y 9 6 9 1 6 |
+---------------------------------------------------------------------------------+
The example shows just one item, but the code should work for several.
The example shows fake data for just three years, but means for other periods can be constructed similarly.
Results are repeated for all observations to which they apply. To see or use results just once, use if. For example the means over 1999 to 2001 are shown for each of those years (and others) but if year == 1999 would be a way to see results just once.
See also help collapse, help egen for its tag() function and this paper.
What was wrong with your code
Your problems start with
if area=="Russia" & area=="Ukraine"
which selects observations for which it is true that area is both "Russia" and "Ukraine" in the same observation, which is impossible. You need the | (or) operator there, not the & operator, or to approach the problem in another way.
The prefix id is wrong too. Using by id: enforces separate calculations for different values of id and is going to make the combinations of identifiers impossible.

How to create subgroups of a fixed number of subjects out of a large group?

I am analyzing data from an experiment composed of 60 subjects. Every subject made a decision which is numerical. I need to create groups of 3 subjects out of the 60, so that at the end I will have 20 groups of three and then take the decisions of each group member and calculate the sum of the value of their decisions, according to the group membership.
I have tried this code which works in reshuffling the group:
students=1:60;
rand_students=sample(students,length(students));
But I need not only to reshuffle them but to pick their decision and then calculate their average by group
It s unclear what programming language you are using. I will use R, but the general concept should be transferable to other languages as well.
I suggest to keep all the data for your students not in separate vectors but in one compound table. In R this would be a data.frame or one of its derivatives. With such a tabular structure, it is easy to aggregate results based on certain criteria. this can be done using base R, though I prefer packages like dplyr or data.table:
set.seed(42)
library(data.table)
# put ID and decision in one structure
student_data <- data.table(id = 1:60,
decision = runif(60))
# assign random group by adding a new column with group ids
student_data[, group := sample(rep(1:20, 3))]
# aggregating the decision by group
student_data[order(group), mean(decision), by = group]
#> group V1
#> 1: 1 0.3620594
#> 2: 2 0.4908625
#> 3: 3 0.6767199
#> 4: 4 0.2572053
#> 5: 5 0.4437264
#> 6: 6 0.8668130
#> 7: 7 0.4014190
#> 8: 8 0.3104299
#> 9: 9 0.6696232
#> 10: 10 0.4790010
#> 11: 11 0.7365284
#> 12: 12 0.7964717
#> 13: 13 0.7549513
#> 14: 14 0.3553608
#> 15: 15 0.6076658
#> 16: 16 0.8211349
#> 17: 17 0.6680640
#> 18: 18 0.6844966
#> 19: 19 0.4591455
#> 20: 20 0.5640735
Created on 2019-09-23 by the reprex package (v0.3.0)

corpus extraction with changing data type R

i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns..
library(tokenizer)
myTokenizer <- function(x, n, n_min) {
corp<-"this is a full text "
tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
}
corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))
writecorpus(corp)
Since I don't have your corpus I created one of my own using the crude dataset from tm. No need to use tm_map as that keeps the data in a corpus format. The tokenizer package can handle this.
What I do is store all your desired matrices in a list object via lapply and then use sapply to store the data in the crude directory as separate files.
Do realize that the matrices as specified in your function will be character matrices. This means that columns 1 and 2 will be characters, not numbers.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
myTokenizer <- function(x, n, n_min) {
tok <- unlist(tokenizers::tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
M[, 3] <- tok
M[, 2] <- lengths(strsplit(M[, 3], "\\W+")) # counts the words
M[, 1] <- 1:length(tok)
return(M)
}
my_matrices <- lapply(crude, myTokenizer, n = 3, n_min = 1)
# make sure directory crude exists as a subfolder in working directory
sapply(names(my_matrices),
function (x) write.table(my_matrices[[x]], file=paste("crude/", x, ".txt", sep=""), row.names = FALSE))
outcome of the first file:
"gram" "num.words" "words"
"1" "1" "diamond"
"2" "2" "diamond shamrock"
"3" "3" "diamond shamrock corp"
"4" "1" "shamrock"
"5" "2" "shamrock corp"
"6" "3" "shamrock corp said"
I would recommend to create a document term matrix (DTM). You will probably need this in your downstream tasks anyway. From that you could also extract the information you want, although, it is probably not reasonable to assume that a term (incl. ngrams) only has a single document where its coming from (at least this is what I understood from your question, please correct me if I am wrong). Therefore, I guess that in practice one term will have several documents associated with it - this kind of information is usually stored in a DTM.
An example with text2vec below. If you could elaborate further how you want to use your terms, etc. I could adapt the code according to your needs.
library(text2vec)
# I have set up two text do not overlap in any term just as an example
# in practice, this probably never happens
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
as.matrix(dtm)
# a a_text and and_another and_another_one another another_one here here_a here_a_text one text
# d1 1 1 0 0 0 0 0 1 1 1 0 1
# d2 0 0 1 1 1 1 1 0 0 0 1 0
library(stringi)
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
for (d in rownames(dtm)) {
v = dtm[d, ]
v = v[v!=0]
v = data.frame(number = 1:length(v)
,term = names(v))
v$n = stri_count_fixed(v$term, "_")+1
write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
}
read.csv("v_d1.csv")
# number term n
# 1 1 a 1
# 2 2 a_text 2
# 3 3 here 1
# 4 4 here_a 2
# 5 5 here_a_text 3
# 6 6 text 1
read.csv("v_d2.csv")
# number term n
# 1 1 and 1
# 2 2 and_another 2
# 3 3 and_another_one 3
# 4 4 another 1
# 5 5 another_one 2
# 6 6 one 1

tidytext words with both positive and negative sentiment

I have been working with the sentiments dataset and found that the bing and nrc datasets contain a few words that have both positive and negative sentiment.
** bing – three words with positive and negative sentiment **
env_test_bing_raw <- get_sentiments("bing") %>%
filter(word %in% c("envious", "enviously","enviousness"))
# A tibble: 6 x 2
word sentiment
<chr> <chr>
1 envious positive
2 envious negative
3 enviously positive
4 enviously negative
5 enviousness positive
6 enviousness negative
** nrc – 81 words with positive and negative sentiment **
test_nrc <- as.data.frame(
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")) %>%
group_by(word) %>%
summarize(count = n()) %>%
filter(count > 1))
env_test_nrc <- get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")) %>%
filter(word %in% test_nrc$word)
# A tibble: 162 x 2
word sentiment
<chr> <chr>
1 abundance negative
2 abundance positive
3 armed negative
4 armed positive
5 balm negative
6 balm positive
7 boast negative
8 boast positive
9 boisterous negative
10 boisterous positive
# ... with 152 more rows
I was curious if I have done something wrong or how a word can have both negative and positive sentiments in a single source dataset. What are the standard practices for handling these situations?
Thank you!
Nope! You have not done anything wrong.
These lexicons were built in different ways. For example, the NRC lexicon was built via Amazon Mechanical Turk, showing human beings lots of words and asking them if they associated each word with joy, sadness, a positive or negative affect, etc. Then the researchers did a careful job of validation, calibration, etc. There are some English words that we as human language users can associate with both positive and negative feeling, such as "boisterous", and the researchers who built these particular lexicons decided to include these words as both.
If you have a text dataset that has the word "boisterous" in it and use a lexicon like this one, it will contribute in both the positive and negative direction (and also toward anger, anticipation, and joy, in that particular case). If you end up calculating a net sentiment (positive minus negative) for some sentiment, section, or document, the effect of that particular word will cancel out.
library(tidytext)
library(dplyr)
get_sentiments("nrc") %>%
filter(word == "boisterous")
#> # A tibble: 5 x 2
#> word sentiment
#> <chr> <chr>
#> 1 boisterous anger
#> 2 boisterous anticipation
#> 3 boisterous joy
#> 4 boisterous negative
#> 5 boisterous positive

rollapply + specnumber = species richness over sampling intervals that vary in length?

I have a community matrix (samples x species of animals). I sampled the animals weekly over many years (in this example, three years). I want to figure out how sampling timing (start week and duration a.k.a. number of weeks) affects species richness. Here is an example data set:
Data <- data.frame(
Year = rep(c('1996', '1997', '1998'), each = 5),
Week = rep(c('1', '2', '3', '4', '5'), 3),
Species1 =sample(0:5, 15, replace=T),
Species2 =sample(0:5, 15, replace=T),
Species3 =sample(0:5, 15, replace=T)
)
The outcome that I want is something along the lines of:
Year StartWeek Duration(weeks) SpeciesRichness
1996 1 1 2
1996 1 2 3
1996 1 3 1
...
1998 5 1 1
I had tried doing this via a combination of rollapply and vegan's specnumber, but got a sample x species matrix instead of a vector of Species Richness. Weird.
For example, I thought that this should give me species richness for sampling windows of two weeks:
test<-rollapply(Data[3:5],width=2,specnumber,align="right")
Thank you for your help!
I figured it out by breaking up the task into two parts:
1. Summing up species abundances using rollapplyr, as implemented in a ddplyr mutate_each thingamabob
2. Calculating species richness using vegan.
I did this for each sampling duration window separately.
Here is the bare bones version (I just did this successively for each sampling duration that I wanted by changing the width argument):
weeksum2 <- function(x) {rollapply(x, width = 2, align = 'left', sum, fill=NA)}
sum2weeks<-Data%>%
arrange(Year, Week)%>%
group_by(Year)%>%
mutate_each(funs(weeksum2), -Year, -Week)
weeklyspecnumber2<-specnumber(sum2weeks[,3:ncol(sum2weeks)],
groups = interaction(sum2weeks$Week, sum2weeks$Year))
weeklyspecnumber2<-unlist(weeklyspecnumber2)
weeklyspecnumber2<-as.data.frame(weeklyspecnumber2)
weeklyspecnumber2$WeekYear<-as.factor(rownames(weeklyspecnumber2))
weeklyspecnumber2<-tidyr::separate(weeklyspecnumber2, WeekYear, into = c('Week', 'Year'), sep = '[.]')

Resources