I have a training data set and test data set with the same categorical columns. Currently, I enumerate through the categorical columns for each data set to produce two sets of countplot subplots for each data set as follows:
plt.figure(figsize=(20,20))
for i, col in enumerate(cat_features):
plt.subplot(5,2,i+1)
sns.countplot(x=col,data=train, order=('A','B','C','D','E','F','G','H','I','J','K','L','N'))
plt.tight_layout()
What I want to be ab;e to do is a side by side comparison between Test and Train; one set of subplots where catplot for Cat0 Train is side by side with Cat0 Test, then subplot catplot for Cat1 Train is next to Cat1 Test, etc,etc.
Train Data looks like (small subset)
cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8
A B A A B D A E C
B A A A B B A E A
A A A C B D A B C
A A A C B D A E G
A B A A B B A E C
Train Data
cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8
A B A C B D A E E
A B A C B D A E C
A B A C B D A E C
A A B A B D A E E
A B A A B B A E E
It's hard to know without some sample data but you can create the four plots as below, then loop through them and the desired order of the datasets, plotting to the relevent axis.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(ncols=2, nrows=2)
for ax, dataset in zip(axes.flatten(), [train, test, train, test]):
sns.countplot(
data = dataset,
x=cat_features,
order = ('A','B','C','D','E','F','G','H','I','J','K','L','N'),
ax=ax)
plt.show()
Related
In Graphql, I can create a union such as the following:
union SearchResult = Book | Movie
Is there a way I can do this for plain strings? Something like this:
union AccountRole = "admin" | "consumer"
I am afraid you cannot do that because it is what defined by the specification.
From the union syntax mentioned at specification here , the part that you want to change should follow the Names syntax , which the first character is only allow to be upper case letter, lower case latter or _
(i.e. the characters set as follows)
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m
n o p q r s t u v w x y z _
Help me with this please.
I have this input column in a datatable
First Case: Second Case: Third Case:
Operation Operation Operation
C C V
C C V
V C V
V C V
C C V
C C V
V C V
C C V
V C V
And I want to know if the dt has C and V or just C or just V.
First you need 2 boolean variables to store the information if C and V exist or not. After that you need to loop through your dt using for each row activity. Inside foreach activity you can use an if activity with assign activty to compare row value with "C" or "V" and set the values of variables accordingly. Finally you can use the values of these variables to decide if your datatable has C and V or just C or just V.
Problem statement-
I want to check if value of column in relation xyz is even then load first 10 fields(1-10) of a file abc and if not then load another 10(11-20).
Relation XYZ
123
Relation ABC
a b c d e f g h i j k l m n o p q r s t
if 123 is even then
relation PQR should have a-j
other wise k-t
Could somebody help.
You should write a storage function to do that.
See the implementation of CSVExcelStorage http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java for example.
i have a string that random generate by a special characters (B,C,D,F,X,Z),for example to generate a following string list:
B D Z Z Z C D C Z
B D C
B Z Z Z D X
D B Z F
Z B D C C Z
B D C F Z
..........
i also have a pattern list, that is to match the generate string and return a best pattern and extract some string from the string.
string pattern
B D C [D must appear before the C >> DC]
B C F
B D C F
B X [if string have X,must be matched.]
.......
for example,
B D Z Z Z C D C Z,that have B and DC,so that can match by B D C
D B Z C F,that have B and C and F,so that can match by B C F
D B Z D F,that have B and F,so that can match by B F
.......
now,i just think about suffix array.
1.first convert a string to suffix array object.
2.loop each a pattern,that find which suffix array can be matched.
3.compare all matched patterns and get which is a best pattern.
var suffix_array=Convert a string to suffix array.
var list=new List();
for (int i=0;i<pattern length;i++){
if (suffix_array.match(pattern))
list.Add(pattern);
}
var max=list[0];
for (int i=1;i<list.length;i++){
{
if (list[i]>max)
max=list[i];
Write(list[i]);
}
i just think this method is to complex,that need to build a tree for a pattern ,and take it to match suffix array.who have a more idea?
====================update
i get a best solution now,i create a new class,that have a B,C,D,X...'s property that is array type.each property save a position that appear at the string.
now,if the B not appear at the string,we can immediately end this processing.
we can also get all the C and D position,and then compare it whether can sequential appear(DC,DCC,CCC....)
I'm not sure what programming language you are using; have you checked its capabilities with regular expressions ? If you are not familiar with these, you should be, hit Google.
var suffix_array=Convert a string to suffix array.
var best=(worst value - presumably zero - pattern);
for (int i=0;i<pattern list array length;i++){
if (suffix_array.match(pattern[i])){
if(pattern[i]>best){
best=pattern[i];
}
(add pattern[i] to list here if you still want a list of all matches)
}
}
write best;
Roughly, anyway, if I understand what you're looking for that's a slight improvement though I'm sure there may be a better solution.
Say I have 5 collections that contain a bunch of strings (hundreds of lines).
Now I want to extract the minimum nr of lines from each of these collections to uniquely identify that 1 collection.
So if I have
Collection 1:
A
B
C
Collection 2:
B
B
C
Collection 3:
C
C
C
Then collection 1 would be identified by A.
Collection 2 would be identified by BC or BB.
Collection 3 would be identified by CC.
Is there any algorithm already out there that does this kind of thing? Name?
Thanks,
Wesley
If the order is not important, I would sort all Lists (Collections).
Then you could look whether all 5 start with the same element. You would group them by the first element:
Start - Character instead of Strings/Lines.:
T A L U D
N I O S A D
R A B E
T A U C
D A N E B
Sorted internally:
A D U L T
A D O N I S
A B E R
A C U T
A B E N D
Sorted:
A B E N D
A B E R
A C U T
A D U L T
A D O N I S
Grouped (2):
(A B) E N D
(A B) E R
(A C) U T # identified by 2 elements
(A D) U L T
(A D) O N I S
Rest grouped by 3 elements:
(A C) U T # identified by 2 elements
(A B E) N D
(A B E) R
(A D U) L T # only ADU...
(A D O) N I S # only ADO...
Rest grouped by 4 elements:
(A C) U T # AC..
(A D U) L T # ADU...
(A D O) N I S # ADO...
(A B E N) D
(A B E R)
This is an easy problem to solve. You have one multiset (collection 1) (it is a "multiset" because the same element can occur multiple times), and then a number of other multisets (collections 2..N), and you want to find a minimum-size subset of collection 1 that does not occur in any of the other collections (2..N).
It is an easy problem to solve because it can be solved by simple set theory. I'll explain this first without multisets, i.e. assuming that every line can occur only once in any given set, and then explain how it works with multiset.
Let's call your collection 1 set S and the other collections sets X1 .. XN. Now keeping in mind that for now the sets do not have multiple instances of any item, it is obvious that any singleton set { a } such that a ∉ Xi distinguishes S from Xi, and so it is enough to calculate the set differences A - X1, ..., A - XN and then pick up a minimum-size set R such that R shares an element with all these difference sets. This is then the SET COVER combinatorial optimization problem that is NP-complete but for your small problem (5 collections) can be handled easily by brute force.
Now then when the sets are actually multisets this only changes so that the distinguishing "singleton" sets are actually multisets containing 1 or more copies of the same element and thus they have different costs. You can still calculate the set differences as above (you subtract element counts), but now your SET COVER combinatorial optimization part has take into account the fact that the distinguishing elements can be multisets and not singletons. Here's an illustration how it works for your problem when we solve for collection 3:
S = {{ c, c, c }}
X1 = {{ a, b, c }}
X2 = {{ b, b, c }}
S - X1 distinguishers: {{ c, c }}
S - X2 distinguishers: {{ c, c }}
Minimum multiset covering a distinguisher for every set: {{ c, c }}
And here how it works for calculating for collection 1:
S = {{ a, b, c }}
X1 = {{ b, b, c }}
X2 = {{ c, c, c }}
S - X1 distinguishers: {{ a }}
S - X2 distinguishers: {{ a }}, {{ b }}
Minimum multiset covering a distinguisher for every set: {{ a }}