Bert only trains for positive labels and not for negative labels - sentiment-analysis

I am fine-tuning BertModel for Sentiment Analysis with Crossentropy loss.
But it only trains if labels are positive in nature but when training for negative labels it throws an error.
For ex- it labels are 0,1,2 it trains perfectly.
But if labels are 0,1,-1 it throws an error-
cross-entropy IndexError: Target -1 is out of bounds.
Everything else is same.
So do I have to use a different loss function?

Related

multiple ROC curve in R with a matrix of prediction values and labels

I want to plot multiple ROC curves with a matrix of predictions and labels. I have > 100 samples with a matrix of predictions and labels for each sample. The length of the samples is different. How could I get design a single matrix for all the samples and get multiple ROC curves in a single plot? I would appreciate any suggestions. Thanks

Elasticsearch Geoshape query false results

I have two geo_shapes in ES. What I need to figure out is the best way to understand if one of the shapes (Green) contains or intersects with another (Red).
Please see below a visual representation of three different cases:
Case I : is easy to detect - using Green shape coordinates make a Geoshape query with “relation" = “within”
Case II : also not a problem to do - using Green shape coordinates make a Geoshape query with “relation" = “INTERSECTS”
Case III : is a real problem - using Green shape coordinates I try to make a Geoshape query with “relation" = “INTERSECTS” and the Red shape is returned as the result…that is false - this shapes do not intersect with each other (I think so) even though one of the sides are touching each other….
Any way to avoid the false positive results here? Any other suggestions how to solve this task?
P.S. the coordinates are precise (example: 13.335594692338). There is no additional mappings like tree_levels or precision...
Every polygon which is stored in Elasticsearch using geoshape is getting transformed into a list of strings.
To narrow down this explanation a bit I'm gonna assume that the polygon you're storing in Elasticsearch is using geohash storage (which is default for geoshape type).
I don't want to get into great details but take a look at this image
and this description taken from Elasticsearch docs (the details don't match but you need to get the big picture):
Geohashes divide the world into a grid of 32 cells—4 rows and 8
columns—each represented by a letter or number. The g cell covers half
of Greenland, all of Iceland, and most of Great Britian. Each cell can
be further divided into another 32 cells, which can be divided into
another 32 cells, and so on. The gc cell covers Ireland and England,
gcp covers most of London and part of Southern England, and gcpuuz94k
is the entrance to Buckingham Palace, accurate to about 5 meters.
You polygon is getting projected into list of rectangles, each being represented with a string (geohash). Precision of this projection is dependent on tree level. I don't know what's the default tree level for Elasticsearch but if you're finding false positives it seems it's too low for you.
A tree level of 8 splits the world in rectangles of size 38.2m x 19.1m. If the edge of your polygon goes trough the middle of this rectangle it may or may not (depending on implementation) assign geohash representation of this rectangle to your polygon.
To solve your problem you need to increase the tree level to match your needs (more on the mapping here). Beware, though that size of the index will increase greatly (also dependent on size and complexity of shapes). As an example storing 1000 district size polygons (some having 100s of points) with a tree level of 8 - the index size is about 600-700MB.
Bear in mind that whatever tree level you choose you always risk to get some false positives as geohash will never be 100% precise representation of your shape. It's a precision vs performance trade-off and geohash is the performance wise choice.

Understanding Gradient Descent Algorithm

I'm learning Machine Learning. I was reading a topic called Linear Regression with one variable and I got confused while understanding Gradient Descent Algorithm.
Suppose we have given a problem with a Training Set such that pair $(x^{(i)},y^{(i)})$ represents (feature/Input Variable, Target/ Output Variable). Our goal is to create a hypothesis function for this training set, Which can do prediction.
Hypothesis Function:
$$h_{\theta}(x)=\theta_0 + \theta_1 x$$
Our target is to choose $(\theta_0,\theta_1)$ to best approximate our $h_{\theta}(x)$ which will predict values on the training set
Cost Function:
$$J(\theta_0,\theta_1)=\frac{1}{2m}\sum\limits_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2$$
$$J(\theta_0,\theta_1)=\frac{1}{2}\times Mean Squared Error$$
We have to minimize $J(\theta_0,\theta_1)$ to get the values $(\theta_0,\theta_1)$ which we can put in our hypothesis function to minimize it. We can do that by applying Gradient Descent Algorithm on the plot $(\theta_0,\theta_1,J(\theta_0,\theta_1))$.
My question is how we can choose $(\theta_0,\theta_1)$ and plot the curve $(\theta_0,\theta_1,J(\theta_0,\theta_1))$. In the online lecture, I was watching. The instructor told everything but didn't mentioned from where the plot will come.
At each iteration you will have some h_\theta, and you will calculate the value of 1/2n * sum{(h_\theta(x)-y)^2 | for each x in train set}.
At each iteration h_\theta is known, and the values (x,y) for each train set sample is known, so it is easy to calculate the above.
For each iteration, you have a new value for \theta, and you can calculate the new MSE.
The plot itself will have the iteration number on x axis, and MSE on y axis.
As a side note, while you can use gradient descent - there is no reason. This cost function is convex and it has a singular minimum that is well known: $\theta = (X^T*X)^{-1)X^Ty$, where yis the values of train set (1xn dimension for train set of size n), and X is 2xn matrix where each line X_i=(1,x_i).

K Nearest Neighbors classification Special Case with Identical Points

The question is about KNN algorithm for classification - the class labels of training samples are discrete.
Suppose that the training set has n points that are identical to the new pattern which we are about to classify, that is the distances from these points to new observation are zero (or <epsilon). It may happen that these identical training points have different class labels. Now suppose that n < K and there are some other training points which are the part of nearest neighbors collection but have non-zero distances to the new observation. How do we assign the class label to new point in this case?
There are few possibilities such as:
consider all K (or more if there are ties with the worst nearest neighbor) neighbors and do majority voting
ignore the neighbors with non-zero distances if there are "clones" of the new point in training data and take the majority vote only over the clones
same as 2. but assign the class with the highest prior probability in the training data (among clones)
...
Any ideas? (references would be appreciated as well)
Each of proposed methods will work in some problems, and in some they won't. In general, there is no need to actually think about such border cases and simply use the default behaviour (option "1" from your question). In fact, if border cases of any classification algorithm becomes the problem it is a signal of at least one of:
bad problem definition,
bad data representation,
bad data preprocessing,
bad model used.
From the theoretical point of view nothing changes if some points are exactly in the place of your training data. The only difference would be, if you have consistent training set (in the sense, that duplicates with different labels do not occur in the training data) and 100% correct (each label is a perfect labeling fot this point), then it would be reasonable to add an if clausule that answers according to the label of the point. But in reallity it is rarely the case.

von Karman curve fitting to field measured wind spectrum

So for this wind monitoring project I'm getting data from a couple of 3d sonic anemometers, specifically 2 R.M.Young 81000. The data output is made digitally with a sampling frequency of 10Hz for periods of 10min. After all the pre-processing (coordinate rotation, trend removal...) I get 3 orthogonal time series of the turbulent data. Right now I'm using the stationary data of 2 hours of measurements with windows of 4096 points and a 50% overlapping to obtain the frequency spectrums in all three directions. After obtaining the spectrum I apply a logarithmic frequency smoothing algorithm, which averages the obtained spectrum in logarithmic spaced intervals.
I have two questions:
The spectrums I obtain from the measured show a clear downward trend in the highest frequencies as seen in the attached figure. I wonder if this loss of energy can have anything to do with an internal filter from the sonic anemometer? Or what else? Is there a way to compensate this loss or better just to consider the spectrum until the "break frequency"?
http://i.stack.imgur.com/B11uP.png
When applying the curve fitting algorithm to determine the integral length scales according to the von Karman equation what is the correct procedure: curve fitting the original data, which gives more weight to higher frequency data points? or using the logarithmic frequency smoothed data to approximate the von karman equation, giving an equal weight to data in the logarithmic scale? In some cases I obtain very different estimates for the integral length scales using both approaches (ex: Original -> Lu=113.16 Lv=42.68 Lw=9.23; Freq. Smoothed -> Lu=148.60 Lv=30.91 Lw=14.13).
Curve fitting with Logarithmic frequency smoothing and with Original data:
http://i.imgur.com/VL2cf.png
Let me know if something is not clear. I'm relatively new in this field, and I might me be making some mistakes in my approach, so if you could give me some advice or tips it would be amazing.

Resources