pyhf: Support for Variable Bin Width Histograms - pyhf

I would like to obtain the expected limits for my analysis using pyhf. The previous iteration of this analysis used variable bin width histograms, and I am wondering whether pyhf can handle this correctly.
I have heard that HistFactory does not support variable bin widths, although I couldn't find any obvious statement of that in the HistFactory paper (I may be missing something obvious, though). The only dosumentation I could find referring to variable bin widths in HistFactory was on this old JIRA ticket, along with a related Root Forum post.
Naively, I would assume that if HistFactory does not support variable binning, then pyhf wouldn't either. However, pyhf doesn't seem to use the bin edges in any way (at least, they are not passed to pyhf at all). Also, I obtained what looked like reasonable results when running a hypotest on variable-binned distributions.
I couldn't find anything in the pyhf documentation saying not to use variably binned inputs, nor did I find anything in the pyhf GitHub issues or tagged here on Stack Overflow. If one should only use uniformly-binned histograms, then this might be good to add to the documentation somewhere (unless it's indeed already there, and I just completely missed it).

ultimately the likelihood boils down to multiple joint counting experiments where each has its own Poisson term, so really variables bins are not a fundamental issue.
There is a bit of an issue in the RooFit implementation since it doesn't follow the poisson structure directly (rather it uses the RooHistFunc), but in pyhf itt should not be an issue.
PS: we started mygrating Q&A re: pyhf to
https://github.com/scikit-hep/pyhf/discussions/categories/q-a
so feel free to continue the discussion there if you have addiitional questions.

To echo and expand a bit on Lukas's answer, as is seen in the pyhf docs section on HistFactory the main part of the probability model is a product of Poissons over all bins in all channels. From this alone we can see that bin width is not considered in the probability model, and as pyhf is an implementation of the full probability model the bin width is not used in pyhf as well. This can be further seen in the "Likelihood Specification" section of the pyhf docs as there is no metadata on bin width to compliment the data field.

Related

Theory, idea for finding copied shapes on an image

The description of my problem is simple, I fear that the problem isn't that simple. I would like to find the copied, duplicated part on an image. Which part of the image is copied and pasted back to the same image to another position(for example by using Photoshop)?
Please check the attached image. The red rectangle containing the value 20 is moved from the price field to the validity field. Please note that the rectangle size and position isn't fixed and unknown, it could vary, just the image is given, no other information.
Could you help me naming a theoretical method, idea, paper, people who are working on the problem above?
I posted my method to here(stackoverflow) instead of Computer Vision to reach as many people I can, because maybe the problem can be transformed. I could think a solution, like looking for the 2 largest rectangle which contain the same values inside a huge matrix(image).
Thanks for your help and time.
Note: I don't want to use the metadata to detect the forgery.
If you have access to the digital version of the forgery, and the forger (or the author of the forger-creation software) is a complete idiot, it can be as simple as looking at the image metadata for signs of 'shopping.
If digital files has been "washed" to remove said signs, or the forgery has been printed and then scanned back to you, it is a MUCH harder problem, again unless the forgers are complete idiots.
In the latter case you can only hope for making the forger's work harder, but there is no way to make it impossible - after all, banknotes can be forged, and they are much better protected than train tickets.
I'd start reading from here: http://www.cs.dartmouth.edu/farid/downloads/publications/spm09.pdf
SHIFT features can be used to identify "similar regions" that might have been copied from a different part of the image. A starting point can be to use OpenCV's SHIFT demo (included in the library) and use parts of the image as input, to see where a rough match is available. Detailed matching can follow to see if the region actually is a copy.

Methods to identify duplicate questions on Twitter?

As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.

Preventing generation of swastika-like images when generating identicons

I am using this PHP script to generate identicons. It uses Don Park's original identicon algorithm.
The script works great and I have adapted it to my own application to generate identicons. The problem is that sometimes swastikas are generated. While swastikas have peaceful origins, people do take offence when seeing those symbols.
What I would like to do is to alter the algorithm so that swastikas are never generated. I have done a bit of digging and found this thread on Microsoft's website where an employee states that they have added a tweak to prevent generation of swastikas, but nothing more.
Has anyone identified what the tweak would be and how to prevent swastikas from being generated?
Identicons appear to me (on a quick glance) always to have four-fold rotational symmetry. Swastikas certainly do. How about just repeating the quarter-block in a different way? If you take a quarter-block that would produce a swastika in the current pattern, and reflect two diagonally-opposite quarters, then you get a sort of space invader.
Basically, nothing with reflectional symmetry can look very much like a swastika. I suppose if there's a small swastika entirely contained within the quarter, then you still have a problem.
On Jeff Atwood's introducing thread, Don Park suggested:
Re Swastika comments, that can be addressed by applying a specialized OCR-like visual analysis to identify all offending codes then crunch them into an effective bloom filter using genetic algorithm. When the filter returns true, a second type of identicon (i.e. 4-block quilt) can be used.
Alternatively, you could avoid the issue entirely by replacing identicons with unicorns.
My original suggestion involving visual analysis was in context of the particular algorithm in use, namely 9-block quilt.
If you want to try another algorithm without Swastika problem, try introducing symmetry like one seen in inkblots to popular 16-block quilt Identicons.

methods of lessening the number of features when machine learning on images

I'm performing machine learning on a 25 x 125 image set. After getting the rgb components it becomes 9375 features per example (and I have about 675). I was trying fminunc and fminsearch and I thought that there was something wrong with my method, because it was 'freezing', but when I decrease the number of features by a factor of 10, it took a while but worked. How can I minimise the number of features, while maintaining the information relevant in the picture? I tried k-means, but I don't see how that helps, as I still have the same number of features, just that there are a lot of redundancy.
You're looking for feature reduction or selection methods. For example see this library:
http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
or see this question
Feature Selection in MATLAB
If you google feature selection/reduction matlab will find many relevant articles/tools. Or you could google some commonly used methods like PCA (principal component analysis).

OpenCV: Fingerprint Image and Compare Against Database

I have a database of images. When I take a new picture, I want to compare it against the images in this database and receive a similarity score (using OpenCV). This way I want to detect, if I have an image, which is very similar to the fresh picture.
Is it possible to create a fingerprint/hash of my database images and match new ones against it?
I'm searching for a alogrithm code snippet or technical demo and not for a commercial solution.
Best,
Stefan
As Pual R has commented, this "fingerprint/hash" is usually a set of feature vectors or a set of feature descriptors. But most of feature vectors used in computer vision are usually too computationally expensive for searching against a database. So this task need a special kind of feature descriptors because such descriptors as SURF and SIFT will take too much time for searching even with various optimizations.
The only thing that OpenCV has for your task (object categorization) is implementation of Bag of visual Words (BOW).
It can compute special kind of image features and train visual words vocabulary. Next you can use this vocabulary to find similar images in your database and compute similarity score.
Here is OpenCV documentation for bag of words. Also OpenCV has a sample named bagofwords_classification.cpp. It is really big but might be helpful.
Content-based image retrieval systems are still a field of active research: http://citeseerx.ist.psu.edu/search?q=content-based+image+retrieval
First you have to be clear, what constitutes similar in your context:
Similar color distribution: Use something like color descriptors for subdivisions of the image, you should get some fairly satisfying results.
Similar objects: Since the computer does not know, what an object is, you will not get very far, unless you have some extensive domain knowledge about the object (or few object classes). A good overview about the current state of research can be seen here (results) and soon here.
There is no "serve all needs"-algorithm for the problem you described. The more you can share about the specifics of your problem, the better answers you might get. Posting some representative images (if possible) and describing the desired outcome is also very helpful.
This would be a good question for computer-vision.stackexchange.com, if it already existed.
You can use pHash Algorithm and store phash value in Database, then use this code:
double const mismatch = algo->compare(image1Hash, image2Hash);
Here 'mismatch' value can easly tell you the similarity ratio between two images.
pHash function:
AverageHash
PHASH
MarrHildrethHash
RadialVarianceHash
BlockMeanHash
BlockMeanHash
ColorMomentHash
These function are well Enough to evaluate Image Similarities in Every Aspects.

Resources