Need clarity on "padding" parameter in Bert Tokenizer - huggingface-transformers

I have been fine-tuning a BERT model for sentence classification. In training, while tokenization I had passed these parameters padding="max_length", truncation=True, max_length=150 but while inferencing it is still predicting even if padding="max_length" parameter is not being passed.
Surprisingly, predictions are the same in both cases when padding="max_length" is passed or not but if padding="max_length" is not being passed, inferencing is much faster.
So, I need some clarity on the parameter "padding" in Bert Tokenizer. Can someone help me to understand how best is able to predict even without the padding since the length of the sentences will differ and does it have any negative consequences If padding="max_length" is not passed while inferencing? Any help would be highly appreciated.
Thanks

When passing a list of sentences to a tokenizer, each sentence might have a different length. Hence the output of the tokenizer for each sentence will have a different length. Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences.
Consider the following example where padding="max_length", max_length=10.
batch_sentences = ["Hello World", "Hugging Face Library"]
encoded_input = tokenizer(batch_sentences, padding="max_length", max_length=10)
print(encoded_input)
{'input_ids': [[101, 8667, 1291, 102, 0, 0, 0, 0, 0, 0], [101, 20164, 10932, 10289, 3371, 102, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}
Notice that the output of the tokenizer for each sentence is padded to the maximum_length which is 10 by a special padding token '0'. Similarly, if we set padding=True, the output of the tokenizer for each sentence will be padded to the length of the longest sequence in the batch.
Coming back to your question, padding has no effect if you pass a list of just one sentence to the tokenizer. If you have set batch_size = 1 during training or inference, your model will be processing your data one sentence at a time. This could be one reason why padding is not making a difference in your case.
Another possible yet very unlikely reason padding does not make a difference in your case is that all your sentences have the same length. Lastly, if you have not converted the output of the tokenizer to a PyTorch or TensorFlow tensor, having varying sentence lengths would not be a problem. This again is unlikely in your case given that you used your model for training and testing.

Related

Why is positional encoding needed while input ids already represent the order of words in Bert?

For example, in Huggingface's example:
encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)
{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The input_ids vector already encode the order of each token in the original sentence. Why does it need positional encoding again with an extra vector to represent it?
The reason is the design of the neural architecture. BERT consists of self-attention and feedforward sub-layers, and neither of them is sequential.
The feedforward layers process each token independently of others.
The self-attention views the input states as an unordered set of states. Attention can be interpreted as soft probabilistic retrieval from a set of values according to some keys. The position embeddings are there so the keys can contain information about their relative order.

Hexagon map finding if some grids are surrounded algorithm

I am creating a simple math game in hexagon geometry through Unity. It is not about Unity indeed.
I borrow the Image from https://catlikecoding.com/unity/tutorials/, to illustrate the problem, it is quite huge so I put it in the link.
Background
Same as tutorial in the link provided, I use an array for saving data, to simplify it, it is like:
[ 0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0,
0, 0, 0, 0 ]
My aim is
Check if one or more than one grids are surrounded by another type of grid.
Definition
Surrounded means for a grid or group of connected grid, all neighbors are in different flag of them.
For example,
[ 0, 1, 1, 0,
1, 0, 1, 0,
0, 1, 1, 0,
0, 0, 0, 0,
0, 0, 0, 0 ]
//Should become
[ 0, 1, 1, 0,
1, 1, 1, 0,
0, 1, 1, 0,
0, 0, 0, 0,
0, 0, 0, 0 ]
It is pretty easy for this case I don't even need an algorithm, as I can create the grid with reference of its neighbor, like
class grid{
grid[] neighbor;
int flag; //0 or 1
}
So, when I need to check if a grid is surrounded, I just need to loop its neighbor.
Problem
However, this method become tedious in the following case
[ 0, 1, 1, 1,
1, 0, 0, 1,
0, 1, 1, 1,
0, 0, 0, 0,
0, 0, 0, 0 ]
So, I also need to check its neighbor's neighbor now, like
foreach (grid i in neighbor){
bool is_surrounded = false;
if (grid.flag == 1) {
//Good
} else {
//Check its neighbor, if every neighbor except i is 1, then return True.
}
}
It is working fine for 2, but what if there is 3 blank grids. Recursion is not ok, as when a grid is not surrounded like
[ 0, 1, 1, 1,
1, 0, 0, 1,
0, 1, 0, 1,
0, 0, 0, 0,
0, 0, 0, 0 ]
I will then loop the whole map for checking around 8^n times.
Question
I think there is cleverer method I didn't realize, I welcome any kind/language of answer or even just an idea. For sure, I will bounty for working ans with explanation.
Thanks.
At first you have to make strict definition - what region is called "surrounded". Perhaps possible approach is - the cells without free way to outer map edge.
To check this way - use any simple traversal algorithm - for example, DFS (path finding algorithms look overkill here - they need final point)
Concerning recursion - you need to mark seen cells to avoid rechecking .There are floodfill algorithms without recursion and with good complexity.
You are going about this backwards. You coding logic looks fine but reverse the logic.
For each 1 in existence check around it for possible other 1's. If you return via any path of 1's (from) that first 1 (to) that first 1 then you have found a closed loop. Find the direction that the returning path came from to retun to the first 1 and then that is where the inside of the loop is. Then, if you are not interested in any deeper loops, mark all that are inside (both 1's and 0's) of that loop as removed from further searching. Complete the search and then after the search is done, and only after the search is done, mark all inside of the loops as 1's (if that is what you want). That way if you have loops beside of loops, then you will not be restarting over and over again.
For sub loops: Consider all 1's as a potential starting of a loop. If you return to any previous 1's then find which direction that came from (in it's return path) and consider that the inside of a loop.
When all this is done, and you have found the loops, then make your changes. Do not be concerned if the loops have zero inner positions as 0 is a valid size, simply make all possible insides of loops changed as you decide.
Thanks.
B Lean

clpfd coverage algorithm speed improvements?

This is a follow up question to Can you use clpfd to implement a coverage algorithm?
I have put the code here: http://swish.swi-prolog.org/p/Coverage%20using%20Constraints%20%20.pl
There are two search procedures.
start_exhaustive_search(Positives,Negatives,r(Features, Value,cm(TP,FP)))
And a heuristic search :
start_search(Ps,Ns,Result).
The heuristic search will refine a rule until it does not cover any negatives. cm is for confusion matrix.
There are three ways to test the predicates, one with a small database accessible with pos(Ps) and negs(Ns). Then a larger database accessible with approved(Ps) and notapproved(Ns). This also has some predicates to turn the binary representation of used features into a list of named features.binary_to_features(Binary,Features).
You can also generate a random matrix of examples using random_binary_matrix_x_y(X,Y,R) (With X as 9 the result will be compatible with the larger approved/notapproved example.).
Example exhaustive query:
?-approved(Ps),notapproved(Ns),start_exhaustive_search(Ps,Ns,Result).
Result = r([0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 21, cm(6, 1)).
Example heuristic query:
?-approved(Ps),notapproved(Ns),start_search(Ps,Ns,Result).
Result = [r([0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 21, cm(6, 1)), r([0, 0, 0, 0, 0, 0, 0, 1, 0, 1], 20, cm(4, 0))]
So both methods do not seem to be as fast as I would imagine is possible using the constraints. Is there a way to improve the speed?
Also I am curious why I cant use dif/2 but have to use \== on line 98?
I am using card/2 to count the number of examples covered, I cant see another way to use this?

How to make one-dimensional k-means clustering using Ruby?

My question:
I have searched through available Ruby gems to find one that performs k-means clustering. I've found quite a few: kmeans, kmeans-clustering, reddavis-k_means and k_means_pp. My problem is that none of the gems deals with one-dimensional k-means clustering. They all expect input like this:
[[1, 2], [3, 4], [5, 6]]
My input looks like this:
[1, 2, 3, 4, 5, 6]
Hence my question: How do I perform a one-dimensional k-means clustering using Ruby?
The context (my task):
I have 100 input values:
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 8, 8, 10, 16, 18, 22, 22, 35, 50, 50
Each value represents a response time, i.e. the number of minutes it took for some customer service agent to respond to an email from a customer. So the first value 0 indicates that the customer only waited 0 minutes for a response.
I need to find out how many fast, medium-fast and slow response time instances there is. In other words, I want to cut my input values up in 3 pools, and then count how many there are in each pool.
The complicating factor is that I based on the overall slope steepness have to figure out where to make the cuts. There is no fixed definition of fast, medium-fast and slow. The first cut (between fast and medium-fast) should occur where the steepness of the slope starts to increase more drastically than before. The second cut (between medium-fast and slow) should occur when an even more dramatic steepness increase occur.
Here is a graphical representation of the input values.
In the above example, common sense would probably define fast as 0-3, because there are many instances of 0, 1, 2, and 3. 4-8 or 4-10 looks like common sense choices for medium-fast. But how to determine something like this mathematically? If the response times were generally faster, then the customers would be expecting this, so an even smaller increase towards the end should trigger the cut.
Finishing notes:
I did find the gem davidrichards-kmeans that deals with one-dimensional k-means clustering, but it don't seem to work properly (the example code raises a syntax error).
k-means is the wrong tool for this job anyway.
It's not designed for fitting an exponential curve.
Here is a much more sound proposal for you:
Look at the plot, mark the three points, and then you have your three groups.
Or look at quantiles... Report the median response time, the 90% quantile, and the 99% quantile...
Clustering is about structure discovery in multivariate data. It's probably not what you want it to be, sorry.
If you insist on trying k-means, try encoding the data as
[[1], [2], [3], [4], [5]]
and check if the results are at least a little bit what you want them to be (also remember that k-means is randomized. Running it multiple times may yield very different results).

Optimizing list processing in Scala

Right now processing a large amount of Json data coming from a Mixpanel API. With a small dataset, it's a breeze and the code below runs just fine. However, a large dataset takes a rather long time to process and we're starting to see timeouts because of it.
My Scala optimization skills are rather poor, so I am hoping someone can show a faster way to process the following with large data sets. Please do explain why since it will help my own understanding of Scala.
val people = parse[mp.data.Segmentation](o)
val list = people.data.values.map(b =>
b._2.map(p =>
Map(
"id" -> p._1,
"activity" -> p._2.foldLeft(0)(_+_._2)
)
)
)
.flatten
.filter{ behavior => behavior("activity") != 0 }
.groupBy(o => o("id"))
.map{ case (k,v) => Map("id" -> k, "activity" -> v.map( o => o("activity").asInstanceOf[Int]).sum) }
And that Segmentation class:
case class Segmentation(
val legend_size: Int,
val data: Data
)
case class Data(
val series: List[String],
val values: Map[String, Map[String, Map[String, Int]]]
)
Thanks for your help!
Edit: sample data as requested
{"legend_size": 4, "data": {"series": ["2013-12-17", "2013-12-18", "2013-12-19", "2013-12-20", "2013-12-21", "2013-12-22", "2013-12-23", "2013-12-24", "2013-12-25", "2013-12-26", "2013-12-27", "2013-12-28", "2013-12-29", "2013-12-30", "2013-12-31", "2014-01-01", "2014-01-02", "2014-01-03", "2014-01-04", "2014-01-05", "2014-01-06"], "values": {"afef4ac12a21d5c4ef679c6507fe65cd": {"id:twitter.com:194436690": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 1, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:330103796": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 1, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:216664121": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 0, "2013-12-23": 1, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}, "id:twitter.com:414117608": {"2013-12-20": 0, "2013-12-29": 0, "2013-12-28": 1, "2013-12-23": 0, "2013-12-22": 0, "2013-12-21": 0, "2013-12-25": 0, "2013-12-27": 0, "2013-12-26": 0, "2013-12-24": 0, "2013-12-31": 0, "2014-01-06": 0, "2014-01-04": 0, "2014-01-05": 0, "2014-01-02": 0, "2014-01-03": 0, "2014-01-01": 0, "2013-12-30": 0, "2013-12-17": 0, "2013-12-18": 0, "2013-12-19": 0}}}}}
To answer Millhouse's question, the intention is to sum up each date to provide a number that describes total volume of "activity" for each ID. The "ID" is formatted as id:twitter.com:923842.
I don't know the full extent of your processing, what pipelines you have going on, what stress your server is under or what sort of threading profile you've set up to receive the information. However, assuming that you've correctly separated I/O from CPU bound tasks and what you've shown us is strictly CPU bound try simply adding .par to the very first Map.
people.data.values.par.map(b =>
as a first pass to see if you can get some performance gains. I don't see any specific ordering required of the processing which tells me it's ripe for parallelization.
Edit
After playing around with parallelization, I would add that modifying the TaskSupport is helpful for this case. You can modify a parallelized collection's tasksupport as such:
import scala.collection.parallel._
val pc = mutable.ParArray(1, 2, 3)
pc.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(2))
See http://www.scala-lang.org/api/2.10.3/index.html#scala.collection.parallel.TaskSupport
I have some suggestions that might help.
I would try to move the filter command as early in the program as
possible. Since your data contains many dates with 0 activity you
would see improvements doing this. The best solution might be to
test for this while parsing the json data. If this is not possible
make it the first statement.
The way I understand it you would like to end up with a way to look up a aggregate of
the sums for a given id. I would suggest you represent this with a map from the id
to the aggregate. Also the scala List class has a sum function.
I came up with this code:
val originalList_IdToAggregate = people.data.values.map(p=> (p._2._1,
p._2._2.sum) );
It might not match your project directly, but I think it is almost what you need.
If you need to make a map of this you just append toMap to the end.
If this doesn't give you enough speed you could create your own parser that aggregates
and filters while parsing only this kind of json.
Writing parsers is quite easy in scala if you are using the parser combinators.
Just keep in mind to throw away what you don't need as early as possible and not to make
too many deep branches this should be a fast solution with a low memory footprint.
As for going parallel this can be a good idea. I don't know enough about
your application to tell you what is the best way, but it might be possible
to hide the computational cost of processing the data under the cost of
transporting the data. Try to balance parsing and io over multiple
threads and see if you can achieve this.

Resources