I imported a .txt file using read_fwf and it seems like col_factor does not support labels as its arguments. I was wondering if there are ways on how you can add labels to specific values in a tibble.
col_types = cols("Income Category" = col_factor(levels = c("01", "02", "03), labels = c("low", "medium", "high"))
I'm fairly new to R so I would be grateful if anyone can answer my question!
Perhaps this helps:
library(tibble)
library(expss)
# Sample data
nn = 99 # total sample size
# create a tibble data frame with single column named IncomeCategory and
# attribute 1, 2, and 3 evenly (just for the example)
tib <- tibble(IncomeCategory = rep(c(1, 2, 3), each = nn/3))
# Attribute labels "low", "medium", and "high" to values 1, 2, and 3 respectively
val_lab(tib$IncomeCategory) = num_lab("1 low
2 medium
3 high")
Structure output shows values are labelled:
> str(tib)
tibble [99 × 1] (S3: tbl_df/tbl/data.frame)
$ IncomeCategory:Class 'labelled' num [1:99] 1 1 1 1 1 1 1 1 1 1 ...
.. .. VALUE LABELS [1:3]: 1=low, 2=medium, 3=high
Sources:
Sample data: stackoverflow
Attribute labels: r-project
Related
I'm trying to extract a matrix with two columns. The first column is the data that I want to group into a vector, while the second column is information about the group.
A =
1 1
2 1
7 2
9 2
7 3
10 3
13 3
1 4
5 4
17 4
1 5
6 5
the result that i seek are
A1 =
1
2
A2 =
7
9
A3 =
7
10
13
A4=
1
5
17
A5 =
1
6
as an illustration, I used the eval function but it didn't give the results I wanted
Assuming that you don't actually need individually named separated variables, the following will put the values into separate cells of a cell array, each of which can be an arbitrary size and which can be then retrieved using cell index syntax. It makes used of logical indexing so that each iteration of the for loop assigns to that cell in B just the values from the first column of A that have the correct number in the second column of A.
num_cells = max (A(:,2));
B = cell (num_cells,1);
for idx = 1:max(A(:,2))
B(idx) = A((A(:,2)==idx),1);
end
B =
{
[1,1] =
1
2
[2,1] =
7
9
[3,1] =
7
10
13
[4,1] =
1
5
17
[5,1] =
1
6
}
Cell arrays are accessed a bit differently than normal numeric arrays. Array indexing (with ()) will return another cell, e.g.:
>> B(1)
ans =
{
[1,1] =
1
2
}
To get the contents of the cell so that you can work with them like any other variable, index them using {}.
>> B{1}
ans =
1
2
How it works:
Use max(A(:,2)) to find out how many array elements are going to be needed. A(:,2) uses subscript notation to indicate every value of A in column 2.
Create an empty cell array B with the right number of cells to contain the separated parts of A. This isn't strictly necessary, but with large amounts of data, things can slow down a lot if you keep adding on to the end of an array. Pre-allocating is usually better.
For each iteration of the for loop, it determines which elements in the 2nd column of A have the value matching the value of idx. This returns a logical array. For example, for the third time through the for loop, idx = 3, and:
>> A_index3 = A(:,2)==3
A_index3 =
0
0
0
0
1
1
1
0
0
0
0
0
That is a logical array of trues/falses indicating which elements equal 3. You are allowed to mix both logical and subscripts when indexing. So using this we can retrieve just those values from the first column:
A(A_index3, 1)
ans =
7
10
13
we get the same result if we do it in a single line without the A_index3 intermediate placeholder:
>> A(A(:,2)==3, 1)
ans =
7
10
13
Putting it in a for loop where 3 is replaced by the loop variable idx, and we assign the answer to the idx location in B, we get all of the values separated into different cells.
I'm trying to generate new rows based on values in a certain column. In current data as you can see 'days_left' column does not have all sequential values.
current = {'assignment': [1,1,1,1,2,2,2,2,2], 'days_left': [1, 2, 5, 9,1, 3, 4, 8, 13]}
dfcurrent = pd.DataFrame(data=current)
dfcurrent
While I want to generate rows into that dataframe to create make sequential list for for 'days_left' for each 'assignment'. Please see the desidered output below:
desired = {'assignment': [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2],
'days_left': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,10,11,12,13]}
dfdesired = pd.DataFrame(data=desired)
dfdesired
Note: The original data is much bigger and has other columns as well but I just simplified it for this question.
Could you please help me how I can solve this?
Thank you very much in advance!
You can iterate through the rows of the current dataframe and create a new dataframe. For each days_left range, copy the current row to the new dataframe and update the days_left column value.
Try this code:
import pandas as pd
current = {'assignment': [1,1,1,1,2,2,2,2,2], 'days_left': [1, 2, 5, 9, 1, 3, 4, 8, 13]}
dfc = pd.DataFrame(data=current)
dfd = pd.DataFrame() # new dataframe
for r in range(1,len(dfc)): # start at 2nd row
for i in range(dfc.iloc[r-1]['days_left'],dfc.iloc[r]['days_left']): # fill gap of missing numbers
dfd = dfd.append(dfc.iloc[r]) # copy row
dfd.reset_index(drop=True, inplace=True) # prevent index duplication
dfd.loc[len(dfd)-1, 'days_left'] = i # update column value
if r == len(dfc)-1 or dfc.iloc[r+1]['assignment']!=dfc.iloc[r]['assignment']: # last entry in assignment
dfd = dfd.append(dfc.iloc[r]) # copy row
dfd.reset_index(drop=True, inplace=True) # prevent index duplication
dfd = dfd.astype(int) # convert all data to integers
print(dfd.to_string(index=False))
Output
assignment days_left
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 1
2 2
2 3
2 4
2 5
2 6
2 7
2 8
2 9
2 10
2 11
2 12
2 13
i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns..
library(tokenizer)
myTokenizer <- function(x, n, n_min) {
corp<-"this is a full text "
tok <- unlist(tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
}
corp <- tm_map(corp,content_transformer(function (x) myTokenizer(x, n=3, n_min=1)))
writecorpus(corp)
Since I don't have your corpus I created one of my own using the crude dataset from tm. No need to use tm_map as that keeps the data in a corpus format. The tokenizer package can handle this.
What I do is store all your desired matrices in a list object via lapply and then use sapply to store the data in the crude directory as separate files.
Do realize that the matrices as specified in your function will be character matrices. This means that columns 1 and 2 will be characters, not numbers.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
myTokenizer <- function(x, n, n_min) {
tok <- unlist(tokenizers::tokenize_ngrams(as.character(x), n = n, n_min = n_min))
M <- matrix(nrow=length(tok), ncol=3,
dimnames=list(NULL, c( "gram" , "num.words", "words")))
M[, 3] <- tok
M[, 2] <- lengths(strsplit(M[, 3], "\\W+")) # counts the words
M[, 1] <- 1:length(tok)
return(M)
}
my_matrices <- lapply(crude, myTokenizer, n = 3, n_min = 1)
# make sure directory crude exists as a subfolder in working directory
sapply(names(my_matrices),
function (x) write.table(my_matrices[[x]], file=paste("crude/", x, ".txt", sep=""), row.names = FALSE))
outcome of the first file:
"gram" "num.words" "words"
"1" "1" "diamond"
"2" "2" "diamond shamrock"
"3" "3" "diamond shamrock corp"
"4" "1" "shamrock"
"5" "2" "shamrock corp"
"6" "3" "shamrock corp said"
I would recommend to create a document term matrix (DTM). You will probably need this in your downstream tasks anyway. From that you could also extract the information you want, although, it is probably not reasonable to assume that a term (incl. ngrams) only has a single document where its coming from (at least this is what I understood from your question, please correct me if I am wrong). Therefore, I guess that in practice one term will have several documents associated with it - this kind of information is usually stored in a DTM.
An example with text2vec below. If you could elaborate further how you want to use your terms, etc. I could adapt the code according to your needs.
library(text2vec)
# I have set up two text do not overlap in any term just as an example
# in practice, this probably never happens
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
as.matrix(dtm)
# a a_text and and_another and_another_one another another_one here here_a here_a_text one text
# d1 1 1 0 0 0 0 0 1 1 1 0 1
# d2 0 0 1 1 1 1 1 0 0 0 1 0
library(stringi)
docs = c(d1 = c("here a text"), d2 = c("and another one"))
it = itoken(docs, tokenizer = word_tokenizer, progressbar = F)
v = create_vocabulary(it, ngram = c(1,3))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
for (d in rownames(dtm)) {
v = dtm[d, ]
v = v[v!=0]
v = data.frame(number = 1:length(v)
,term = names(v))
v$n = stri_count_fixed(v$term, "_")+1
write.csv(v, file = paste0("v_", d, ".csv"), row.names = F)
}
read.csv("v_d1.csv")
# number term n
# 1 1 a 1
# 2 2 a_text 2
# 3 3 here 1
# 4 4 here_a 2
# 5 5 here_a_text 3
# 6 6 text 1
read.csv("v_d2.csv")
# number term n
# 1 1 and 1
# 2 2 and_another 2
# 3 3 and_another_one 3
# 4 4 another 1
# 5 5 another_one 2
# 6 6 one 1
Requirements:
I'm trying to create a bar graph where, for each condition ("label"), I show the mean task time for each manipulation ("pattern"). So, there will be 8 groups of 3 bars, and one group with a single bar.
I need to show error bars (standard error) on each of these bars.
I want the order of each condition/label to be determined from some calculations done using some other metrics. (These I've already extracted into a dict mapping from label ==> index/order)
I'm going to be drawing a few other graphs, and whatever sort order is used in this must be the same across the others too
This is with Python 2.7, Pandas 0.18, and in an IPython Notebook
(The dataframe is loaded from a csv file, and not constructed directly)
Problem:
So, here is what the graph currently looks like:
Current Graph
I've replaced/removed the labels for uploading here, but, just like these labels, the originals were sorted alphabetically.
And therein lies the problem: I don't want each technique sorted alphabetically. Instead, I want them to be sorted based on a sorting order I've got in a separate list (i.e. so that I can get them showing up in a sequential order - shortest to tallest, while maintaining the same order across graphs).
Current Code:
So, I load the full dataset in from a csv file:
p = pd.read_csv("...", sep='\t')
Then, I use groupby to extract the "task_time" data to draw each bar:
tt_all = p.groupby(['label', 'pattern'])[['task_time']]
This is then drawn by doing:
tt_all.mean().unstack().plot(kind='bar', yerr=tt_all.sem().unstack(), figsize=(15, 6), cmap=cmap, edgecolor='None', rot=45)
(Without unstack(), it just shoves everything into a single category and creates a mess)
What I've tried:
After a lot of poking around, I've managed to get the following:
# Create a column to use for sorting things
sort_order_keys = {'I': 8, 'F': 3, 'H': 7, 'G': 1, 'D': 2, 'C': 5, 'E': 6, 'A': 4, 'B': 0}
p['label_sort_key'] = p['label'].apply(lambda x: sort_order_keys[x])
# This sorts all the rows by the sort order
tt_all_raw = p.sort(['label_sort_key', 'pattern'])
tt_all_raw = tt_all_raw.iloc[tt_all_raw['label_sort_key'].argsort()]
print tt_all_raw # <--- This will be sorted correctly
# Performing grouping....
tt_all = tt_all_raw.groupby(['label', 'pattern'], sort=False)[['task_time']]
print tt_all.mean() # <---- This will also be sorted correctly
print tt_all.mean().unstack() # <--- This however forces everything back to alphabetical order! Argh!
Question(s)
How can I re-sort the unstack() results? OR
Is there an easier way to set up a graph like this, with these requirements?
I think you can use CategoricalIndex with categories with custom order, which is easy sorted by sort_index:
print (p)
label pattern task_time
0 I 0 3
1 E 0 0
2 B 1 2
3 D 1 1
4 G 1 0
5 F 0 3
6 H 0 0
7 D 1 2
8 A 1 1
9 C 1 0
tt_all = p.groupby(['label', 'pattern'])[['task_time']]
print (tt_all.mean())
task_time
label pattern
A 1 1.0
B 1 2.0
C 1 0.0
D 1 1.5
E 0 0.0
F 0 3.0
G 1 0.0
H 0 0.0
I 0 3.0
df1 = tt_all.mean().unstack()
df1.index = pd.CategoricalIndex(df1.index,
categories=['B', 'G', 'D', 'F', 'A', 'C', 'E', 'H', 'I'],
ordered=True)
df1.sort_index(inplace=True)
print (df1)
task_time
pattern 0 1
B NaN 2.0
G NaN 0.0
D NaN 1.5
F 3.0 NaN
A NaN 1.0
C NaN 0.0
E 0.0 NaN
H 0.0 NaN
I 3.0 NaN
I had the same issue and I bypassed this by mutating the GroupBy object into a DataFrame and passing the orderd categories as a list to the index parameter.
Would this work for you?
sort_order_keys = ['B', 'G', 'D', 'F', 'A', 'C', 'E', 'H', 'I']
tt_all = pandas.DataFrame(tt_all, index = sort_order_keys)
You should then be able to use the plotting functions on the dataframe.
When using the .count() method on the group, I found that it is useful to transform the GroupBy object into a dictionary using dict() before passing it to the pandas.DataFrame() function and use the group labels as labels for the columns and naming the index as 'count', for example:
tt_allCount = pandas.DataFrame(dict(tt_all.count()),
columns = sort_order_keys,
index=['count'])
I have an image 640x480 img, and I want to replace pixels having values not in this list or array x=[1, 2, 3, 4, 5] with a certain value 10, so that any pixel in img which doesn't have the any of the values in x will be replaced with 10. I already know how to replace only one value using img(img~=1)=10 or multiple values using this img(img~=1 & img~=2 & img~=3 & img~=4 & img~=5)=10 but I when I tried this img(img~=x)=10 it gave an error saying Matrix dimensions must agree. So if anyone could please advise.
You can achieve this very easily with a combination of permute and bsxfun. We can create a 3D column vector that consists of the elements of [1,2,3,4,5], then use bsxfun with the not equals method (#ne) on your image (assuming grayscale) so that we thus create a 3D matrix of 5 slices. Each slice would tell you whether the locations in the image do not match an element in x. The first slice would give you the locations that don't match x = 1, the second slice would give you the locations that don't match x = 2, and so on.
Once you finish this, we can use an all call operating on the third dimension to consolidate the pixel locations that are not equal to all of 1, 2, 3, 4 or 5. The last step would be to take this logical map, which that tells you the locations that are none of 1, 2, 3, 4, or 5 and we'd set those locations to 10.
One thing we need to consider is that the image type and the vector x must be the same type. We can ensure this by casting the vector to be the same class as img.
As such, do something like this:
x = permute([1 2 3 4 5], [3 1 2]);
vals = bsxfun(#ne, img, cast(x, class(img)));
ind = all(vals, 3);
img(ind) = 10;
The advantage of the above method is that the list you want to use to check for the elements can be whatever you want. It prevents having messy logical indexing syntax, like img(img ~= 1 & img ~= 2 & ....). All you have to do is change the input list at the beginning line of the code, and bsxfun, permute and any should do the work for you.
Here's an example 5 x 5 image:
>> rng(123123);
>> img = randi(7, 5, 5)
img =
3 4 3 6 5
7 2 6 5 1
3 1 6 1 7
6 4 4 3 3
6 2 4 1 3
By using the code above, the output we get is:
img =
3 4 3 10 5
10 2 10 5 1
3 1 10 1 10
10 4 4 3 3
10 2 4 1 3
You can most certainly see that those elements that are neither 1, 2, 3, 4 or 5 get set to 10.
Aside
If you don't like the permute and bsxfun approach, one way would be to have a for loop and with an initially all true array, keep logical ANDing the final result with a logical map that consists of those locations which are not equal to each value in x. In the end, we will have a logical map where true are those locations that are neither equal to 1, 2, 3, 4 or 5.
Therefore, do something like this:
ind = true(size(img));
for idx = 1 : 5
ind = ind & img ~= idx;
end
img(ind) = 10;
If you do this instead, you'll see that we get the same answer.
Approach #1
You can use ismember,
which according to its official documentation for a case of ismember(A,B) would output a logical array of the same size as A and with 1's where
any element from B is present in A, 0's otherwise. Since, you are looking to detect "not in the list or array", you need to invert it afterwards, i.e. ~ismember().
In your case, you have img as A and x as B, so ~ismember(img,x) would give you those places where img~=any element in x
You can then map into img to set all those in it to 10 with this final solution -
img(~ismember(img,x)) = 10
Approach #2
Similar to rayryeng's solution, you can use bsxfun, but keep it in 2D which could be more efficient as it would also avoid permute. The implementation would look something like this -
img(reshape(all(bsxfun(#ne,img(:),x(:).'),2),size(img))) = 10