I am wondering if there are any reasonable ways to generate clients data sets for federated learning simulation using tff core code? In the tutorial for the federated core, it uses the MNIST database with each client has only one distinct label in his data set. In this case, there are only 10 different labels available. If I want to have more clients, how can I do that? Thanks in advance.
If you want to create a dataset from scratch you can use tff.simulation.FromTensorSlicesClientData to covert tensors to tff clientdata object. Just you need to pass dictionary having client id as key and dataset as value.
client_train_dataset = collections.OrderedDict()
for i in range(1, split+1):
client_name = "client_" + str(i)
start = image_per_set * (i-1)
end = image_per_set * i
print(f"Adding data from {start} to {end} for client : {client_name}")
data = collections.OrderedDict((('label', y_train[start:end]), ('pixels', x_train[start:end])))
client_train_dataset[client_name] = data
train_dataset = tff.simulation.FromTensorSlicesClientData(client_train_dataset)
you can check my complete implementation here, where i have splitted mnist into 4 clients.
There are preprocessed simulation datasets in TFF that should serve this purpose quite nicely. Take, for example, loading EMNIST where the images are partitioned by writer, corresponding to user, as opposed to label. This can be loaded into a Python runtime rather simply (here creating train data with 100 clients):
source, _ = tff.simulation.datasets.emnist.load_data()
def map_fn(example):
return {'x': tf.reshape(example['pixels'], [-1]), 'y': example['label']}
def client_data(n):
ds = source.create_tf_dataset_for_client(source.client_ids[n])
return ds.repeat(10).map(map_fn).shuffle(500).batch(20)
train_data = [client_data(n) for n in range(100)]
There are existing datasets partitioned in a similar way for extended MNIST (IE, includes handwritten characters in addition to digits), Shakespeare plays (partitioned by character), and Stackoverflow posts (partitioned by user). Documentation on these datasets can be found here.
If you wish to create your own custom dataset, please see the answer here.
Related
Intro.
I am processing multiple images in parallel by mapping a set of functions using dask client.map to the images and return a pandas dataframe. I cannot predict the size of the output before the computation.
In order to make the code more readable and not have to drag around in the computation images metadata i have been looking in xarray.
I create a dataset by loading .zarr files using open_mfdataset and parallel=True.
Here is a snippet to create a mock dataset similar to the one i am working with
import xarray as xr
from dask import array as da
import numpy as np
# Create data array
def create_mock_dataset():
data = da.random.random([2,3,2,10,10])
mock_array = xr.DataArray(
data=data,
coords={
"fov":np.arange(2),
"round_num": np.arange(3),
'z':np.arange(2),
'r':np.arange(10),
'c':np.arange(10),
},
dims=["fov", "round_num","z","r","c"])
ds = xr.Dataset({"mock": mock_array})
chunks_dict = {'fov':1,'round_num':1,'z':2,'r':10,'c':10}
ds = ds.chunk(chunks_dict)
return ds
test_dataset = create_mock_dataset()
def chunk_processing_func(xarray_chunk):
# processing of the chunk
# reduced to different shape or
# data structure ex. pandas dataframe
mock_output = np.arange(200) # can a dataframe or another data structure
return mock_output
I would like to process process each chunk (that correspond to an image) in parallel and make use of the coords data in the xarray.
I have been testing apply_ufunc (following the instruction from this really clear answer ) or map_blocks but if I understood correctly the size of the output must be known.
So what will be the best approach to process in parallel xarray datasets with functions that use the coords info and the data but return a different type of output that doesn't need to be aligned with the dataset?
Thanks!
I have a table with over 100 columns. I need to remove double quotes from certain columns. I found 2 ways to do it, using withColumn() and map()
Using withColumn()
cols_to_fix = ["col1", ..., "col20"]
for col in cols_to_fix:
df = df.withColumn(col, regexp_replace(df[col], "\"", ""))
Using map()
def remove_quotes(row: Row) -> Row:
row_as_dict = row.asDict()
cols_to_fix = ["col1", ..., "col20"]
for column in cols_to_fix:
if row_as_dict[column]:
row_as_dict[column] = re.sub("\"", "", str(row_as_dict[column]))
return Row(**row_as_dict)
df = df.rdd.map(remove_quotes).toDF(df.schema)
Here is my question. I found using map() takes about 4 times longer than withColumn() on a table that has ~25M records. I will really appreciate if any fellow stack overflow user can explain the reason for the performance difference, so that I can avoid similar pitfall in future.
firstly, one piece of advice: do not convert DataFrame to RDD and just do df.map(your function here), this may save a lot of time.
the following page
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds
would save us a lot of time, its main conclusion is that RDD is remarkably slow than DataFrame/Dataset, not to mention the time used for the conversion from DataFrame to RDD.
Let's talk about map and withColumn without any conversion between DataFrame to RDD now.
Conclusion first: map is usually 5x slower than withColumn. the reason is that map operation always involves deserialization and serialization while withColumn can operate on column of interest.
to be specific, map operation should deserialize the Row into several parts on which the operation will be carrying,
An example here :
assume we have a DataFrame which looks like
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
Then we want to increment all the values in column users_count by 1, we can do it like this:
df.map(row => {
val usersCount = row.getInt(1) + 1
(row.getString(0), usersCount)
}).toDF("language", "users_count_incremented_by_1")
In the code above, we firstly need to deserialize every row to extract the values in the 2nd column, after that we output the modified values and save it as an DataFrame(this step requires serialization of (a,b) into Row(a, b) since DataFrame is nothing but a DataSet of Rows).
for more detailed explanation, check the following excellent article
https://medium.com/#fqaiser94/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44
map can not operate on the column itself but have to operate on the values of the column, getting the values require deserialization, saving it as a DataFrame requires serialization.
But map is still of great use: with the help of map method people could implement very sophisticated operations while just built-in operations could be done if we just use withColumn.
To sum it up, map is slower but more flexible, withColumn is surely the most efficient while it's functionality is limited.
Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.
I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic.
My question is this. I have to run lda[corpus] for somewhat the second time in order to get these topics. Is there some other builtin gensim function that will give me this topic assignment vectors directly? Especially since the LDA algorithm has passed through the documents it might have saved these topic assignments?
# Get the Dictionary and BoW of the corpus after some stemming/ cleansing
texts = [[stem(word) for word in document.split() if word not in STOPWORDS] for document in cleanDF.text.values]
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.9)
corpus = [dictionary.doc2bow(text) for text in texts]
# The actual LDA component
lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=30, chunksize=10000, passes=10,workers=4)
# Assign each document to most prevalent topic
lda_topic_assignment = [max(p,key=lambda item: item[1]) for p in lda[corpus]]
There is no other builtin Gensim function that will give the topic assignment vectors directly.
Your question is valid that LDA algorithm has passed through the documents but implementation of LDA is working by updating the model in chunks (based on value of chunksize parameter), hence it will not keep the entire corpus in-memory.
Hence you have to use lda[corpus] or use the method lda.get_document_topics()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
test =LDA[corpus[0]]
print(test)
sorted(test, reverse=True, key=lambda x: x[1])
Topics = ['Topic_'+str(sorted(LDA[i], reverse=True, key=lambda x: x[1])[0][0]).zfill(3) for i in corpus]
I am quite new to image processing and would like to produce an array that stores 10 images. After which I would like to run a for loop through some code that identifies some properties of the images, specifically the surface area of a biological specimen, which then spits out an array containing 10 areas.
Below is what I have managed to scrap up so far, and this is the ensuing error message:
??? Index exceeds matrix dimensions.
Error in ==> Testing1 at 14
nova(i).img = imread([myDir B(i).name]);
Below is the code I've been working on so far:
my_Dir = 'AC04/';
ext_img='*.jpg';
B = dir([my_Dir ext_img]);
nfile = max(size(B));
nova = zeros(1,nfile);
for i = 1:nfile
nova(i).img = imread([myDir B(i).name]);
end
areaarray = zeros(1,nfile);
for k = 1:nfile
[nova(k), threshold] = edge(nova(k), 'sobel');
.
.
.
.%code in this area is irrelevant to the problem I think%
.
.
.
areaarray(k) = bwarea(BWfinal);
end
areaarray
There are few ways you could store an image in a kind of an array structure in Matlab. You could use array of structs. In that case you could do as you did:
nova(i).img = imread([myDir B(i).name]);
You access first image with nova(1).img, second one with nova(2).img etc.
Other way to do it is to use cell array (similar to arrays but are more flexible in the sense that members could be of the different type):
nova{i} = imread([myDir B(i).name]);
You access first image with nova{1}, second one with nova{2} etc.
[ IMPORTANT ] In both cases you should remove this line from code:
nova = zeros(1,nfile);
I suppose you've tried to pre-allocate memory for images, and since you're beginner I advise you not to be concerned with it. It is an optimization concern to be addressed if you come across some performance issues - and if you don't come across them, take advantage of Matlab's automatic memory (re)allocation.