Creating edge AND node table from a tweeter csv file for using in gephi- python dataframe codes required - edge-detection

I would appreciat if anyone could help , just new to text minig and SNA ...
I have as the first part of my assignment two columns one the tweet of each user , second the hashtags that were inside that tweet,some have more than one #,each # is considered as a node, now I want to form an edge table consisting two columns of source and target to import node and edge table to gephi,the logic of edges is this: if two or more hashtags comes in a tweet it forms an edge between those nodes(hashtags):
source
target
1
4
3
8
df=pd.read_csv(r'/content/Tweetss.csv',sep='\t')
df['tweets']=df['user_id~username~tweet~hashtags']
my_cols=set(df.columns)
my_cols.remove('user_id~username~tweet~hashtags')
my_cols = list(my_cols)
df2 = df[my_cols]
df2.head()
df2.tweets.str.extractall(r"(#\w+)").unstack()
df2['Labels'] = df2.tweets.apply(lambda x: [x for x in x.split(" ")
if x.startswith("#")])
df2.head()

Related

How to filter a column according to values of another column in Tableau

Suppose that my query is 'A' for the following table. I want to find any value of 'c_index' corresponding to 'A', and then get all the rows of the table which have the corresponding values of 'c_index'.
Node Name
c_index
A
1
B
1
A
2
C
2
B
3
D
3
C
4
E
4
Values of 'c_index' corresponding to 'A' are {1, 2}. So the desired result of the filter is:
Node Name
c_index
A
1
B
1
A
2
C
2
How can I do this filtration in Tableau?
What I tried is:
Defined a filter on 'c_index' (i.e. drag and drop 'c_index' to the filter shelf). And then I tried to define the condition for the filter as: [Node Name] = 'A'.
But it throws an error: "The formula must be an aggregate calculation or refer only to this field".
First Join the (data) table with itself on the column which you want to return linked values. In the example c_index.
Now there will two same data sets in your data pane.
Add node from first dataset to filter, node from second dataset to view and c_index from anyone to view. You'll get what you desire. See GIF below

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

Referencing from table with mixed cells of different categories

I'm trying to program a Google Sheets for comparing and analyzing logistic costs.
I have the following:
A sheet with a database of numbers, organized like this:
A second sheet with a table in which, using the MIN function, I get the price of the cheapest provider for each model, depending on quantity and destination.
And last, into another sheet, I have what I call "The interface". Using an INDEX MATCH MATCH formula, I let the user choose destination and quantity for each one of the models avalable, and it returns the cheapest price. (I can't post more images, so basically it has this structure):
MODEL A
DESTINATION: DESTINATION 2
NUM. OBJ: 2
PRICE: 59
PROVIDER:
My problem is that I can't figure how to make it return the name of the provider with the cheapest price, as I'm referencing from the second table, in which in a same row or column there are cells with prices that belong to different providers.
Using min is undesirable in this context, because it doesn't tell you where the minimal value was found, and you need this information.
Here is a formula that returns the minimal cost together with the provider. In my example, the data is in the range A1:E7, as below; destination is in G1 and model is in G2.
=iferror(array_constrain(sort({filter(A1:A7, B1:B7=G2), filter(filter(C1:E7, B1:B7=G2), C1:E1=G1)}, 2, True), 1, 2), "Not found")
The same with linebreaks for readability:
=iferror(
array_constrain(
sort(
{
filter(A1:A7, B1:B7 = G2),
filter(filter(C1:E7, B1:B7 = G2), C1:E1 = G1)
},
2, True),
1, 2),
"Not found")
Explanation:
filtering by B1:B7 = G2 means keeping only the rows with the desired model
filtering by C1:E1 = G1 means keeping only the column with desired destination
{ , } means putting two parts of a filtered table together: column A, and column with destination
sort by 2nd column (price), in ascending order (true)
array_constrain keeps only the first row in this sort; that is, one with lowest price.
iferror is in case there is no such destination or model in the table. Then the function returns "not found".
Example: with G1 = Destination 1 and G2 = A, the formula returns
Provider 2 2

Core Data. Is it possible creating a view like you would do with normal SQL

In normal SQL world you would use Create View .... to define a view on one or more tables, e.g. to get a join and already a group by. Is that also possible somehow in Core data?
The reason I'm asking is, I have a table with a details. Each detail record has two keys and an amount. Now I need to show the sum of the amounts grouped by the two keys in a table view - i.e. The first key in the section and the second as normal entry with the sum amount. I thought FRC would work, but it does not group (add up the detail records). With a normal fetch request I can group and get everything - but it seems to be a lot of work to handle the sections manual. So I thought, the best is, I put a view on the table and use the FRC to bring it in the table view. Does that make sense? Any help ist very much appreciated.
example:
I have three fields:
A X 2
A X 2
A Z 3
B X 2
B Y 2
B Y 1
B Z 8
as a result I need
Section : A
X 4
Z 3
Section: B
Y 2
Z 8
So I am not sure if there is a shorter answer but here's how you can do it.
I'll assume the first column, second column and third column are called: firstCol, secondCol, thirdCol.
You can use this predicate to get all object for "A" and put it in resultArray:
//loop over the letters A to Z. Here's what it would look like:
NSPredicate *aPredicate = [NSPredicate predicateWithFormat:#"firstCol = %#)", #"A"];
Then find all the second column letters for objects that have A in first column (resultArray):
NSArray *allLetters = [resultArray valueForKeyPath:#"#distinctUnionOfObjects.secondCol"];
In case of "A" allLetters will include X and Z. Then loop over allLetters and add up the third column:
For (NSString *letter in allLetters) {
int sum = [allLetters valueForKeyPath:[NSString stringWithFormat:#"#sum.%#", letter]];
//this sums up each letter for example returns 4 for X in case of "A"
//insert the sum in an Array and then a Dictionary that can be used for data source of the table.
}

Neo4j - adding extra relationships between nodes in a list

I have a list of nodes representing a history of events for users forming, a following pattern:
()-[:s]->()-[:s]->() and so on
Each of the nodes of the list belongs to a user (is connected via a relationship).
I'm trying to create individual user histories (add a :succeeds_for_user relationship between all events that happend for a particular user, such that each event has only one consecutive event).
I was trying to do something like this to extract nodes that should be in a relationship:
start u = node:class(_class = "User")
match p = shortestPath(n-[:s*..]->m), n-[:belongs_to]-u-[:belongs_to]-m
where n <> m
with n, MIN(length(p)) as l
match p = n-[:s*1..]->m
where length(p) = l
return n._id, m._id, extract(x IN nodes(p): x._id)
but it is painfully slow.
Does anyone know a better way to do it?
Neo4j is calculating a lot of shortest paths there.
Assuming that you have a history start node (with for the purpose of my query has id x), you can get an ordered list of event nodes with corresponding user id like this:
"START n=node(x) # history start
MATCH p = n-[:FOLLOWS*1..]->(m)<-[:DID]-u # match from start up to user nodes
return u._id,
reduce(id=0,
n in filter(n in nodes(p): n._class != 'User'): n._id)
# get the id of the last node in the path that is not a User
order by length(p) # ordered by path length, thus place in history"
You can then iterate the result in your program and add relationships between nodes belonging to the same user. I don't have a fitting big dataset, but it might be faster.

Resources