How to have a sentiment score for a document in Quanteda? - sentiment-analysis

I am new in sentiment analysis. Quanteda examples show how to output numbers of positive and negative words. I tested some documents. It output below:
Case 1
document negative positive
file1 28 28
file2 98 71
file3 28 22
file4 37 39
file5 7 36
or below
Case 2
document negative positive neg_positive neg_negative
file1 28 28 0 1
file2 98 71 0 0
file3 28 22 1 0
file4 37 39 0 1
file5 7 36 0 1
Can you let me know how to have scores for file1 .. file5 in both cases? Is that
(#positive - #negative) / #all in case 1 file2, (71-98)/(71+98)=-27/169= - 0.15 ?
what about case 2?
Thanks a lot.
A

If you consider neg_positive as negative, and neg_negative as positive, then you could create your index by combining the pairs of columns. This is plausible because the "neg positive" for instance contains sequences such as "not good".
(rowSums(object[, c("negative", "neg_positive")]) -
rowSums(object[, c("positive", "neg_negative")])) / rowSums(object) * 100
Another (better) measure is the logit scale described in
2011. William Lowe, Kenneth Benoit, Slava Mikhaylov, and Michael Laver. "Scaling Policy Preferences From Coded Political Texts." Legislative Studies Quarterly 26(1, Feb): 123-155. This is the log(positive/negative) or
log( rowSums(object[, c("positive", "neg_negative")]) /
rowSums(object[, c("negative", "neg_positive")]) )

Related

Correlation with covariate in R

I have 2 files. File1 contains gene expression- 300 samples(rows) and ~50k genes (50k columns), File2 contains MRI data -the same 300 samples and ~100 columns. I want to correlate GE data with MRI data controlling for covariate which is disease status (file3, status 0 or 1).
I have tried to use ppcor package, for 2 variables it worked:
pcor.test(file1$gene1, file2$var1, file3$status, method="pearson")
but I want to run for all variables, so I merged into 1 file, last columns being status:
sapply(1:(ncol(test)-1), function(x) sapply(1:(ncol(test)-1), function(y) {
if (x == y) 1
else pcor.test(test[,x], test[,y], test[,ncol(test)])$estimate
}))
but had this error:
Error in solve.default(cvx) :
system is computationally singular: reciprocal condition number = 6.36939e-18
Am I doing this correctly? Is this a good method?
Thank you for suggestions and help
Georg
File1
gene1 gene2 gene3 ... gene50,000
Sample1 12 300 70 4000
Sample2 25 100 53 4500
Sample3 70 30 71 2000
...
Sample300 18 200 97 1765
File2
var1 var2 var3 ... var100
Sample1 5 1 170 200
Sample2 7 3 153 100
Sample3 7 18 130 34
...
Sample300 18 54 197 71
File3-STATUS
status
Sample1 1
Sample2 1
Sample3 0
...
Sample300 1

Matching Column numbers from two different txt file

I have two text files which are a different size. The first one below example1.txt has only one column of numbers:
101
102
103
104
111
120
120
125
131
131
131
131
131
131
And the Second text file example2.txt has two columns:
101 3
102 3
103 3
104 4
104 4
111 5
120 1
120 1
125 2
126 2
127 2
128 2
129 2
130 2
130 2
130 2
131 10
131 10
131 10
131 10
131 10
131 10
132 10
The first column in the example1.txt is a subset of column one in example2.txt. The second column numbers in example2.txt are the associated values with the first column.
What I want to do is to get the associated second column of example1.txt following the example2.txt. I have tried but couldn't figure it out yet. Any suggestions or solutions in bash, awk would be appreciated
Therefore the result would be:
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
UPDATE:
I have been trying to do the column matching like :
awk -F'|' 'NR==FNR{c[$1]++;next};c[$1] > 0' example1.txt example2.txt > output.txt
In both files, the first column goes like an ascending order, but the frequency of the same numbers may not be the same. For example, the frequency of 104 is one in the example1.txt, but it appeared twice in the example2.txt The important thing is that the associated second column value would be the same for example1.txt too. Just see the expected output in the end.
$ awk 'NR==FNR{a[$1]++; next} ($1 in a) && b[$1]++ < a[$1]' f1 f2
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
This solution doesn't make use of the fact that the first column is in ascending order. Perhaps some optimization can be done based on that.
($1 in a) && b[$1]++ < a[$1] is the main difference from your solution. This checks if the field exists as well as that the count doesn't exceed that of the first file.
Also, not sure why you set the field separator as | because there is no such character in the sample given.

Max of all columns based on distinct first column

How do I change the code if I have more than two columns. Let's say the data is like this
ifile.dat
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
How do I achieve this?
ofile.dat
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46
I mean the max of each column by comparing first column. Thanks.
Here is what I have tried(on a sample file with 13columns). But the highest value is not coming up this way.
cat input.txt | sort -k1,1 -k2,2nr -k3,3nr -k4,4nr -k5,5nr -k6,6nr -k7,7nr -k8,8nr -k9,9nr -k10,10nr -nrk11,11 -nrk12,12 -nrk13,13 | sort -k1,1 -u
Instead of relying on sort, you could switch over to something more robust like awk:
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' input.txt
That's a mouthful. It breaks down like:
Before processing the file set the PROCINFO setting of sorted_in to #val_num_asc since we will be sorting the contents of an array by its index which will be numeric: (BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"})
Loop through each field in the record for(i=2;i<=NF;++i)
Test to see if the current value is greater than the stored value in the array which has as its index the first field's value. If it is greater, then store it in place of whatever is currently held in the array for that index at that fields position: (if (a[$1][i]<$i){a[$1][i]=$i})
When done processing all of the records, sort the array into new array asorted: (END{n=asorti(a, asorted);)
Iterate through the array and print each element (for(col1 in asorted){print col1, a[col1][2], a[col1][3]})
There may be a more elegant way to do this in awk, but this will do the trick.
Example:
:~$ cat test
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
:~$ awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' test
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46

Get the average of the selected cells line by line in a file?

I have a single file with the multiple columns. I want to select few and get average for selected cell in a line and output the entire average as column.
For example:
Month Low.temp Max.temp Pressure Wind Rain
JAN 17 36 120 5 0
FEB 10 34 110 15 3
MAR 13 30 115 25 5
APR 14 33 105 10 4
.......
How to get average temperature (Avg.temp) and Humidity (Hum)as column?
Avg.temp = (Low.temp+Max.temp)/2
Hum = Wind * Rain
To get the Avg.temp
Month Low.temp Max.temp Pressure Wind Rain Avg.temp Hum
JAN 17 36 120 5 0 26.5 0
FEB 10 34 110 15 3 22 45
MAR 13 30 115 25 5 21.5 125
APR 14 33 105 10 4 23.5 40
.......
I don't want to do it in excel. Is there any simple shell command to do this?
I would use awk like this:
awk 'NR==1 {print $0, "Avg.temp", "Hum"; next} {$(NF+1)=($2+$3)/2; $(NF+1)=$5*$6}1' file
or:
awk 'NR==1 {print $0, "Avg.temp", "Hum"; next} {print $0, ($2+$3)/2, $5*$6}' file
This consists in doing the calculations and appending them to the original values.
Let's see it in action, piping to column -t for a nice output:
$ awk 'NR==1 {print $0, "Avg.temp", "Hum"; next} {$(NF+1)=($2+$3)/2; $(NF+1)=$5*$6}1' file | column -t
Month Low.temp Max.temp Pressure Wind Rain Avg.temp Hum
JAN 17 36 120 5 0 26.5 0
FEB 10 34 110 15 3 22 45
MAR 13 30 115 25 5 21.5 125
APR 14 33 105 10 4 23.5 40

adding columns for specified rows & dividing by the number of rows using awk

So I'm really new to using linux and script commands, help would really be appreciated!
I have a file of 1050 rows and 8 columns. Example:
anger 1 0 5 101 13 2 somesentenceofwords
anger 2 0 5 101 23 3 somesentenceofwords
anger 3 0 3 101 35 3 somesentenceofwords
anger 4 0 2 101 23 3 somesentenceofwords
arch 5 0 3 101 34 12 somesentenceofwords
arch 6 0 2 101 45 23 somesentenceofwords
arch 7 0 2 101 23 12 somesentenceofwords
hand 8 9 0 101 32 21 somesentenceofwords
hand 9 0 2 101 23 12 somesentenceofwords
What I want to do is if the first column is the same for x number of rows then output the sum of the 6th column for those rows and divide it by the number of rows (an average essentially).
So in the example since the first 4 rows are all anger I want to get the average of the numbers corresponding to all rows with anger in column 1 for column 6. It would add 13 + 23 + 35 + 23 / 4. It would then do the same for arch, then hand and so on.
Example output:
anger 23.5
arch 34
hand 27.5
I tried this just to see if I can do it individually where each column would equal a specific letter string but couldn't even get that to work.
$ awk '{if($1="anger"){sum+=$6} {print sum}}' filename
Is this possible?
Using awk:
awk '!($1 in s){b[++i]=$1; s[$1]=0} {c[$1]++; s[$1]+=$6}
END{for (k=1; k<=i; k++) printf "%s %.1f\n", b[k], s[b[k]]/c[b[k]]}' file
anger 23.5
arch 34.0
hand 27.5
Pretty straight forward with awk:
$ awk '{a[$1]+=$6;b[$1]++}END{for (i in a) print i,a[i]/b[i]}' file
hand 27.5
arch 34
anger 23.5
How this works?
The block {a[$1]+=$6;b[$1]++} is executed for every line that is read. We create two maps, one storing the sum, for each key, and one storing the count for each key.
The block END{for (i in a) print i,a[i]/b[i]} is executed after all lines are read. We iterate over the keys of the first map, and print the key, and the division of the sum over the count (i.e. the mean).

Resources