tidytext words with both positive and negative sentiment - tidytext

I have been working with the sentiments dataset and found that the bing and nrc datasets contain a few words that have both positive and negative sentiment.
** bing – three words with positive and negative sentiment **
env_test_bing_raw <- get_sentiments("bing") %>%
filter(word %in% c("envious", "enviously","enviousness"))
# A tibble: 6 x 2
word sentiment
<chr> <chr>
1 envious positive
2 envious negative
3 enviously positive
4 enviously negative
5 enviousness positive
6 enviousness negative
** nrc – 81 words with positive and negative sentiment **
test_nrc <- as.data.frame(
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")) %>%
group_by(word) %>%
summarize(count = n()) %>%
filter(count > 1))
env_test_nrc <- get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")) %>%
filter(word %in% test_nrc$word)
# A tibble: 162 x 2
word sentiment
<chr> <chr>
1 abundance negative
2 abundance positive
3 armed negative
4 armed positive
5 balm negative
6 balm positive
7 boast negative
8 boast positive
9 boisterous negative
10 boisterous positive
# ... with 152 more rows
I was curious if I have done something wrong or how a word can have both negative and positive sentiments in a single source dataset. What are the standard practices for handling these situations?
Thank you!

Nope! You have not done anything wrong.
These lexicons were built in different ways. For example, the NRC lexicon was built via Amazon Mechanical Turk, showing human beings lots of words and asking them if they associated each word with joy, sadness, a positive or negative affect, etc. Then the researchers did a careful job of validation, calibration, etc. There are some English words that we as human language users can associate with both positive and negative feeling, such as "boisterous", and the researchers who built these particular lexicons decided to include these words as both.
If you have a text dataset that has the word "boisterous" in it and use a lexicon like this one, it will contribute in both the positive and negative direction (and also toward anger, anticipation, and joy, in that particular case). If you end up calculating a net sentiment (positive minus negative) for some sentiment, section, or document, the effect of that particular word will cancel out.
library(tidytext)
library(dplyr)
get_sentiments("nrc") %>%
filter(word == "boisterous")
#> # A tibble: 5 x 2
#> word sentiment
#> <chr> <chr>
#> 1 boisterous anger
#> 2 boisterous anticipation
#> 3 boisterous joy
#> 4 boisterous negative
#> 5 boisterous positive

Related

Number of occurrences of 2 as a digit in numbers from 0 to n , Not getting the O(n) solution?

This the GFG Link
In this link, I am not able to get anything intuition that how we are calculating the number of 2 as a digit in,
My doubt is if we are counting the 6000 digits in the range as explained in the below description then why we are simply dividing the number by 10 and returning it, If anyone can help me, please do post your answer with examples
Case digits < 2
Consider the value x = 61523 and digit at index d = 3 (here indexes are considered from right and rightmost index is 0). We observe that x[d] = 1. There are 2s at the 3rd digit in the ranges 2000 – 2999, 12000 – 12999, 22000 – 22999, 32000 32999, 42000 – 42999, and 52000 – 52999. So there are 6000 2’s total in the 3rd digit. This is the same amount as if we were just counting all the 2s in the 3rd digit between 1 and 60000.
In other words, we can round down to the nearest 10d+1, and then divide by 10, to compute the number of 2s in the d-th digit.
if x[d) < 2: count2sinRangeAtDigit(x, d) =
Compute y = round down to nearest 10d+1
return y/10
Case digit > 2
Now, let’s look at the case where d-th digit (from right) of x is greater than 2 (x[d] > 2). We can apply almost the exact same logic to see that there are the same number of 2s in the 3rd digit in the range 0 – 63525 as there as in the range 0 – 70000. So, rather than rounding down, we round up.
if x[d) > 2: count2sinRangeAtDigit(x, d) =
Compute y = round down to nearest 10d+1
return y / 10
Case digit = 2
The final case may be the trickiest, but it follows from the earlier logic. Consider x = 62523 and d = 3. We know that there are the same ranges of 2s from before (that is, the ranges 2000 – 2999, 12000 – 12999, … , 52000 – 52999). How many appear in the 3rd digit in the final, partial range from 62000 – 62523? Well, that should be pretty easy. It’s just 524 (62000, 62001, … , 62523).
if x[d] = 2: count2sinRangeAtDigit(x, d) =
Compute y = round down to nearest 10d+1
Compute z = right side of x (i.e., x% 10d)
return y/10 + z + 1**// here why we are doing it ,what is the logic behind this approach**
There is not complete clarity in the explantion given above that's why I am asking here Thank you
For me that explanation is strange too. Also note that true complexity is O(log(n)) because it depends on nummber length (digit count).
Consider the next example: we have number 6125.
At the first round we need to calculate how many 2's are met as the rightmost digit in all numbers from 0 to 6125. We round number down to 6120 and up to 6130. Last digit is 5>2, so we have 613 intervals, every interval contains one digit 2 as the last digit - here we count last 2's in numbers like 2,12,22,..1352,..,6122.
At the second round we need to calculate how many 2's are met as the second (from right) digit in all numbers from 0 to 6125. We round number down to 6100 and up to 6200. Also we have right=5. Digit is 2, so we have 61 intervals, every interval contains ten digits 2 at the second place (20..29, 120..129... 6020..6029). We add 61*10. Also we have to add 5+1 2's for values 6120..6125
At the third round we need to calculate how many 2's are met as the third (from right) digit in all numbers from 0 to 6125. We round number down to 6000 and up to 7000. Digit is 1, so we have 6 intervals, every interval contains one hundred of digit 2 at the third place (200.299.. 5200..5299). So add 6*100.
I think it is clear now that we add 1 interval with thousand of 2's (2000.2999) as the leftmost digit (6>2)

How do I representation percentage in evolutionary Algorithm?

Considering I have 4 chromosomes (gi, i=1 to 4}) to represent 4 percentages of different things so that the sum of 4 percentages are equal to 100. How Do I represent this efficiently?
I know that it is possible by: g1/(g1+g2+g3+g4). However, This is not efficient. Consider all gi=0.2 or all gi=0.1 will represent 25% in these two cases. It is possible to generate many cases where different genes present same percentage. Is there any other efficient way, where unique set of combination of genes present unique set of percentages.
Thanks in advance.
I think you're confusing genes and chromosomes. A chromosome encodes a candidate solution to your problem. A gene is part of a chromosome.
Under this setting, why would you want that constraint on the chromosomes? it sounds like you want it on the genes of a chromosome.
In order to do this you can do a number of things: have each gene encode an integer in [0, 100]. If the genes do not add to 100 in the end, penalize the fitness of those chromosomes.
Another way, which might make crossover operators more natural to apply, is to have each gene store 100 bits. If x bits are set, that means the gene will encode x%.
Yet another way is to have the entire chromosome encode 100 set bits. Then each gene will hold a value x, which represents an interval. The number of set bits between two split points is the percentage associated to that gene. For example:
1 2 3 4 5 6 7 8 ... 100
1 1 1 1 1 1 1 1 ... 1
| | | | |
g1 g2 g3 g4
This can be done by generating 5 random numbers <= 100, sorting them and taking the differences between them.
One way to assign X units to N possibilities is to store X * (N-1) bits. Every unit is given (N-1) bits and if k of the (N-1) bits are set then the unit is assigned to k.
This is easy to work with as there are no invalid solutions and no penalties/repairs are necessary. This makes fitness evaluation, crossover and mutation easier to implement.
For example, the problem is to assign 5 units (X) to one of 4 (N) possibilities. Each individual is (4-1)x5=15 bits.
The bit string: 010 100 000 011 111 assigns the first 2 units to possibility 1 because both groups have 1 bit set. The third unit which has no bits set is assigned to 0. The fourth unit is assigned to 2 and the fifth to 3.
partition units
0 1
1 2
2 1
3 1

Find the nearest nice number

Given a base currency of GBP £, and a table of other currencies accepted in a shop:
Currency Symbol Subunits LastToGBPRate
------------------------------------------------------
US Dollars $ 100 0.592662000
Euros € 100 0.810237000
Japanese Yen ¥ 1 0.005834610
Bitcoin ฿ 100000000 301.200000000
We have a working method that converts a given amount in GBP Pence (AKA cents) into Currency X cents. Given a price of 999 (£9.99), for the above currencies it would return:
Currency Symbol
---------------------
US Dollars 1686
Euros 1233
Japanese Yen 1755
Bitcoin 3482570
This is all working absolutely fine. We then have a Format Currency method which converts them all into nice looking numbers:
Currency Formatted
---------------------
US Dollars $16.86
Euros €12.33
Japanese Yen ¥1755
Bitcoin ฿0.03482570
Now the problem we want to solve, is to round these amounts to the nearest meaningful pretty number in a general purpose algorithm given the information above.
This serves two important benefits:
Prices for most currencies should appear static for visitors over short-medium term time frames
Presents the visitor with a culturally meaningul price point which encourages sales
A meaningful number is one where the smallest unit displayed isn't smaller than the value of say £0.10, and a pretty number is one which ends in 49 or 99. Example outputs:
Currency Formatted Meaninful and Pretty
-----------------------------------------------------
US Dollars $16.86 $16.99
Euros €12.33 €12.49
Japanese Yen ¥1755 ¥1749
Bitcoin ฿0.03482570 ฿0.0349
I know it is possible to do this with a single algorithm with all the information given, but I'm struggling to work out even where to start. Can anyone show me how to achieve this, or give pointers?
Please note, storing a general formatting rule for each currency is not adequate because assume for example the price of Bitcoin 10x's, the formatting rule will need updating. I'm looking for a solution that doesn't need any manual maintainance/checking.
For a given decimal value X, you want to find the smallest integer Y such that YA + B as close as possible to X, for some given A and B. E.g. in the case of dollar, you have A = .5 and B = .49.
In general, for your problem, A and B can be computed via the formula:
V = value of £0.10 in target currency
K = smallest power of ten (10^k) such that 9*10^k >= V
and k <= -2 (this condition I added based on your examples, but contrary
to your definition)
= 10^min(-2, ceil(log10(V / 9)))
A = 50 * K
B = 49 * K
Note that without the extra condition, since 0.09 dollars is less than 0.10 pounds, we would get 14.9 as the result for 16.86 dollars.
With some transformation we get
Y ~ (X - B) / A
And since Y is integer, we have
Y = round((X - B) / A)
The result is then YA + B.
Convert £0.10 to the current currency to determine the smallest displayable digit (SDD)
(bounded by the number of available digits in that currency).
Now we basically have 3 choices of numbers:
... (3rdSDD-1) 9 9 (if 3rdSDD is 0, it will obviously carry from 4thSDD and so on, as subtraction normally works)
We'll pick this when 10*2ndSDD + 1stSDD < 24
... 3rdSDD 4 9
We'll pick this when 24 <= 10*2ndSDD + 1stSDD < 74
... 3rdSDD 9 9
We'll pick this when 74 < 10*2ndSDD + 1stSDD
It should be trivial to figure it out from here.
Some multiplication and modulus to get you 2ndSDD and 1stSDD.
Basic subtraction to get you ... (3rdSDD-1).
A few if-statements to pick one of the above cases.
Example:
For $16.86, our 3 choices are $15.99, $16.49 and $16.99.
We pick $16.99 since 74 < 86.
For €12.33, our 3 choices are €11.99, €12.49 and €12.99.
We pick €12.49 since 24 <= 33 < 74.
For ¥1755, our 3 choices are ¥1699, ¥1749 and ¥1799.
We pick ¥1749 since 24 <= 55 < 74.
For ฿0.03482570, our 3 choices are ฿0.0299, ฿0.0349 and ฿0.0399.
We pick ฿0.0349 since 24 <= 48 < 74.
And, just to show the carry:
For $100000.23, our 3 choices are $99999.99, $100000.49 and $100000.99.
We pick $99999.99 since 23 < 24.
Here's an ugly answer:
def retail_round(number):
"""takes a decimal.Decimal and retail rounds it"""
ending_digits = str(number)[-2:]
if not ending_digits in ("49","99"):
rounding_adjust = (99 - int(ending_digits)) % 50
if rounding_adjust <= 25:
number = str(number)[:-2]+str(int(ending_digits)+int(rounding_adjust))
else:
if str(number)[-3] == '.':
number = str(int(number) - .01)
else:
number = str(int(str(number)[:-2]+"00")-1)
return decimal.Decimal(number)
>>> import decimal
>>> retail_round(decimal.Decimal("15.50"))
Decimal('14.99')
>>> retail_round(decimal.Decimal("15.51"))
Decimal('14.99')
>>> retail_round(decimal.Decimal("15.75"))
Decimal('15.99')
>>> retail_round(decimal.Decimal("1575"))
Decimal('1599')
>>> retail_round(decimal.Decimal("1550"))
Decimal('1499')
EDIT: this is a bit better solution, using decimal.Decimal
Currency = collections.namedtuple("Currency",["name","symbol",
"subunits"])
def retail_round(currency, amount):
"""returns a decimal.Decimal amount of the currency, rounded to
49 or 99."""
adjusted = ( amount / currency.subunits ) % 100 # last two digits
print(adjusted)
if adjusted < 24:
amount -= (adjusted + 1) * currency.subunits # down to 99
elif 24 <= adjusted < 74:
amount -= (adjusted - 49) * currency.subunits # to 49
else:
amount -= (adjusted - 99) * currency.subunits # up to 99
return amount
Calculate the maximum length of the price, assume its something like 0.00001. (You can do that by changing £0.10 to the currency, then taking the 10 base log of it, getting its ceil and that power of 10).
Eg: £0.10 = 17.1421309¥
log(17.1421309) = 1.234
ceil(1.234) = 2
10^2 = 100
so
¥174055 will be ¥174900
Adjust the number for the digit, add 1, round to 50, subtract 1:
174055 -> (round((174055/100+1)/50)*50-1)*100 = 174900
Plain and simple.

Convert rank-per-candidate format to OpenSTV BLT format

I recently gathered, using a questionnaire, a set of opinions on the importance of various software components. Figuring that some form of Condorcet voting method would be the best way to obtain an overall rank, I opted to use OpenSTV to analyze it.
My data is in tabular format, space delimited, and looks more or less like:
A B C D E F G # Candidates
5 2 4 3 7 6 1 # First ballot. G is ranked first, and E is ranked 7th
4 2 6 5 1 7 3 # Second ballot
etc
In this format, the number indicates the rank and the sequence order indicates the candidate.
Each "candidate" has a rank (required) from 1 to 7, where a 1 means most important and a 7 means least important. No duplicates are allowed.
This format struck me as the most natural way to represent the output, being a direct representation of the ballot format.
The OpenSTV/BLT format uses a different method of representing the same info, conceptually as follows:
G B D C A F E # Again, G is ranked first and E is ranked 7th
E B G A D C F #
etc
The actual numeric file format uses the (1-based) index of the candidate, rather than the label, and so is more like:
7 2 4 3 1 6 5 # Same ballots as before.
5 2 7 1 4 3 6 # A -> 1, G -> 7
In this format, the number indicates the candidate, and the sequence order indicates the rank. The actual, real, BLT format also includes a leading weight and a following zero to indicate the end of each ballot, which I don't care too much about for this.
My question is, what is the most elegant way to convert from the first format to the (numeric) second?
Here's my solution in Python, and it works ok but feels a little clumsy. I'm sure there's a cleaner way(perhaps in another language?)
This took me longer than it should have to wrap my head around yesterday afternoon, so maybe somebody else can use this too.
Given:
ballot = '5 2 4 3 7 6 1'
Python one(ish)-liner to convert it:
rank = [i for r,i in sorted((int(r),i+1) for i,r in enumerate(ballot.split())]
rank = " ".join(rank)
Alternatively, in a slightly more understandable form:
# Split into a list and convert to integers
int_ballot = [int(x) for x in ballot.split()]
# This is the important bit.
# enumerate(int_ballot) yields pairs of (zero-based-candidate-index, rank)
# Use a list comprehension to swap to (rank, one-based-candidate-index)
ranked_ballot = [(rank,index+1) for index,rank in enumerate(int_ballot)]
# Sort by the ranking. Python sorts tuples in lexicographic order
# (ie sorts on first element)
# Use a comprehension to extract the candidate from each pair
rank = " ".join([candidate for rank,candidate in sorted(ranked_ballot)])

Special scheduling Algorithm (pattern expansion)

Question
Do you think genetic algorithms worth trying out for the problem below, or will I hit local-minima issues?
I think maybe aspects of the problem is great for a generator / fitness-function style setup. (If you've botched a similar project I would love hear from you, and not do something similar)
Thank you for any tips on how to structure things and nail this right.
The problem
I'm searching a good scheduling algorithm to use for the following real-world problem.
I have a sequence with 15 slots like this (The digits may vary from 0 to 20) :
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
(And there are in total 10 different sequences of this type)
Each sequence needs to expand into an array, where each slot can take 1 position.
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
The constraints on the matrix is that:
[row-wise, i.e. horizontally] The number of ones placed, must either be 11 or 111
[row-wise] The distance between two sequences of 1 needs to be a minimum of 00
The sum of each column should match the original array.
The number of rows in the matrix should be optimized.
The array then needs to allocate one of 4 different matrixes, which may have different number of rows:
A, B, C, D
A, B, C and D are real-world departments. The load needs to be placed reasonably fair during the course of a 10-day period, not to interfere with other department goals.
Each of the matrix is compared with expansion of 10 different original sequences so you have:
A1, A2, A3, A4, A5, A6, A7, A8, A9, A10
B1, B2, B3, B4, B5, B6, B7, B8, B9, B10
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
D1, D2, D3, D4, D5, D6, D7, D8, D9, D10
Certain spots on these may be reserved (Not sure if I should make it just reserved/not reserved or function-based). The reserved spots might be meetings and other events
The sum of each row (for instance all the A's) should be approximately the same within 2%. i.e. sum(A1 through A10) should be approximately the same as (B1 through B10) etc.
The number of rows can vary, so you have for instance:
A1: 5 rows
A2: 5 rows
A3: 1 row, where that single row could for instance be:
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
etc..
Sub problem*
I'de be very happy to solve only part of the problem. For instance being able to input:
1 1 2 3 4 2 2 3 4 2 2 3 3 2 3
And get an appropriate array of sequences with 1's and 0's minimized on the number of rows following th constraints above.
Sub-problem solution attempt
Well, here's an idea. This solution is not based on using a genetic algorithm, but some ideas could be used in going in that direction.
Basis vectors
First of all, you should generate what I think of as the basis vectors. For instance, if your sequence were 3 numbers long rather than 15, the basis vectors would be:
v1 = [1 1 0]
v2 = [0 1 1]
v3 = [1 1 1]
Any solution for sequence length 3 would be a linear combination of these three vectors using only positive integers. In other words, the general solution would be
a*v1 + b*v2 + c*v3
where a, b and c are positive integers. For the sequence [1 2 1], the solution is v1 = 1, v2 = 1, v3 = 0. What you first want to do is find all of the possible basis vectors of length 15. From my rough calculations I think that there are somewhere between 300-400 basis vectors of length 15. I can give you some tips towards generating them if you want.
Finding solutions
Now, what you want to do is sort these basis vectors by their sums/magnitudes. Then in searching for your solution, you start with the basis vectors which have the largest sums. We start with the vectors that have the largest sums because they lead to having less total rows. We also have an array, veccoefs, which contains an entry for the linear coefficient for each basis vector. At the beginning of searching for the solution, all the veccoefs are 0.
So we take the first basis vector (the one with the largest sum/magnitude) and subtract this vector from the sequence until we either create an unsolvable result ( having a 0 1 0 in it for instance) or any of the numbers in the result is negative. We store the number of times we subtract the vector in veccoefs. We use the result after subtracting the basis vector from the sequence as the sequence for the next basis vector. If there are only zeros left in the result, then we stop the loop.
I'm not sure of the efficiency/accuracy of this method, but it might at least give you some ideas.
Other possible solutions
Another idea for solving this is to use the basis vectors and form the problem as an optimization/least squares problem. You form a matrix of the basis vectors such that the basic problem will be minimizing Sum[(Ax - b)^2] where A is the matrix of basis vectors, b is the input sequence, and x are the basis vector coefficients. However, you also want to minimize the number of rows, so you can add a term like x^T*x to the minimization function where x^T is the transpose of x. The hard part in my opinion is finding differentiable terms to add that will encourage integer vector coefficients. If you can think of a way to do that, then optimization could very well be a good way to do this.
Also, you might consider a Metropolis-type Monte Carlo solution. You would choose randomly whether to add a vector, remove a vector, or substitute a vector at each step. The vector to be added/removed/substituted would be chosen randomly. The probability of this change to be accepted would be a ratio of the suitabilities of the solutions before the change and after the change. The suitability could be equal to the difference between the current solution and the sequence, squared and summed, minus the number of rows/basis vectors involved in the solution. You would need to put in appropriate constants to for various terms to try to get the acceptance rate around 50%. I kind of doubt that this will work very well, but I thought that you should still consider it when looking for possible solutions.
GA can be applied to this problem, but it won't be 5 minute task. You need to put several things together, without knowing which implementation of each of them is best.
So:
Solution representation - how you will represent possible solution? Using matrix seems to be most straight forward. Using collection of one dimensional arrays is possible also.
But you have some constrains, so maybe SuperGene concept is worth considering?
You must use proper mutation/crossover operators for given gene representation.
How will you enforce constrains on solutions? Destroying those that are not proper? What if they contain valuable information? Maybe let them stay in population but add some penalty to fitness, so they will contribute to offspring, but won't go into next generations?
Anyway I think that GA can be applied to this problem. Is it worth? Usually GA are not best algorithm, but they are decent algorithm if others fail. I would go with GA, just because it would be most fun but I would look for alternative solution (just in case).
P.S. Personal insight: I was solving N Queens Problem, for 70 < N < 100 (board NxN, N queens). Algorithm was working fine for lower N (maybe it was trying all combination?), but with N in this range, I couldn't find proper solution. Fitness quickly jumped to about 90% of max, but in the end there were always two queens conflicting. But it was very naive implementation.

Resources