How to select one specific frame - image

I am detecting vehicle from the video/camera , it work fine for detecting vehicle from the video/camera , but if suppose 6 sec video contain 2 vehicles and every vehicle contain 2 sec in the video than it extract 35+ frame for one vehicle and same case for the other vehicle , in simple words it extract all the frames from the video which contain vehicles but my requirement is to extract the only one frame of one vehicle like if 6 sec video contain 2 vehicles than i should extract the 2 frames which contain the whole vehicle and ignore all other frames . I already implemented entropy technique on it which make it better but still am getting too much frames of same vehicle . I want to know the technique in which i can extract the only frame which contain the whole vehicle and ignore all other frames which contain that vehicle (same vehicle)

Assuming that you not only get a binary detection result ("there is a car") but also some kind of spatial information ("there is a car, and its bounding box is ...") then you can simply keep the frame that shows the most.
Something like this
best_frame = None
best_frame_score = 0.0
for frame in video:
has_car, score = detect_car(frame)
if has_car and score > best_frame_score:
best_frame = frame
best_frame_score = score
This assumes that the function detect_car returns a binary detection result and some score. The score could for example be the size of the bounding box.

Related

Detecting individual images in an array of images

I'm building a photographic film scanner. The electronic hardware is done now I have to finish the mechanical advance mechanism then I'm almost done.
I'm using a line scan sensor so it's one pixel width by 2000 height. The data stream I will be sending to the PC over USB with a FTDI FIFO bridge will be just 1 byte values of the pixels. The scanner will pull through an entire strip of 36 frames so I will end up scanning the entire strip. For the beginning I'm willing to manually split them up in Photoshop but I would like to implement something in my program to do this for me. I'm using C++ in VS. So, basically I need to find a way for the PC to detect the near black strips in between the images on the film, isolate the images and save them as individual files.
Could someone give me some advice for this?
That sounds pretty simple compared to the things you've already implemented; you could
calculate an average pixel value per row, and call the resulting signal s(n) (n being the row number).
set a threshold for s(n), setting everything below that threshold to 0 and everything above to 1
Assuming you don't know the exact pixel height of the black bars and the negatives, search for periodicities in s(n). What I describe in the following is total overkill, but that's how I roll:
use FFTw to calculate a discrete fourier transform of s(n), call it S(f) (f being the frequency, i.e. 1/period).
find argmax(abs(S(f))); that f represents the distance between two black bars: number of rows / f is the bar distance.
S(f) is complex, and thus has an argument; arctan(imag(S(f_max))/real(S(f_max)))*number of rows will give you the position of the bars.
To calculate the width of the bars, you could do the same with the second highest peak of abs(S(f)), but it'll probably be easier to just count the average length of 0 around the calculated center positions of the black bars.
To get the exact width of the image strip, only take the pixels in which the image border may lie: r_left(x) would be the signal representing the few pixels in which the actual image might border to the filmstrip material, x being the coordinate along that row). Now, use a simplistic high pass filter (e.g. f(x):= r_left(x)-r_left(x-1)) to find the sharpest edge in that region (argmax(abs(f(x)))). Use the average of these edges as the border location.
By the way, if you want to write a source block that takes your scanned image as input and outputs a stream of pixel row vectors, using GNU Radio would offer you a nice method of having a flow graph of connected signal processing blocks that does exactly what you want, without you having to care about getting data from A to B.
I forgot to add: Use the resulting coordinates with something like openCV, or any other library capable of reading images and specifying sub-images by coordinates as well as saving to new images.

exactly how do we compute timestamp differentials?

We get timestamps as a double value for pose, picture, and point data - they aren't always aligned - how do I calculate the temporal distance between two time stamps ? Yes, I know how to subtract two doubles, but I'm not at all sure of how the delta corresponds to time.
I have some interesting timestamp data that sheds light on your question, without exactly answering it. I have been trying to match up depth frames with image frames - just as a lot of people posting under this Tango tag. My data did not match exactly and I thought there were problems with my projection matrices and point reprojection. Then I checked the timestamps on my depth frames and image frames and found that they were off by as much as 130 milliseconds. A lot! Even though I was getting the most recent image whenever a depth frame was available. So I went back to test just the timestamp data.
I am working in Native with code based on the point-cloud-jni-example. For each of onXYZijAvailable(), onFrameAvailable(), and onPoseAvailable() I am dumping out time information. In the XYZ and Frame cases I am copying the returned data to a static buffer for later use. For this test I am ignoring the buffered image frame, and the XYZ depth data is displayed in the normal OpenGL display loop of the example code. The data captured looks like this:
callback type : systime : timestamp : last pose
I/tango_jni_example( 3247): TM CLK Img 5.420798 110.914437 110.845522
I/tango_jni_example( 3247): TM CLK XYZ 5.448181 110.792470 110.845522
I/tango_jni_example( 3247): TM CLK Pose 5.454577 110.878850
I/tango_jni_example( 3247): TM CLK Img 5.458924 110.947708 110.878850
I/tango_jni_example( 3247): TM CLK Pose 5.468766 110.912178
The system time is from std::chrono::system_clock::now() run inside of each callback. (Offset by a start time at app start.) The timestamp is the actual timestamp data from the XYZij, image, or pose struct. For depth and image I also list the most recent pose timestamp (from start-of-service to device, with given time of 0.0). A quick analysis of about 2 minutes of sample data leads to the following initial conclusions:
Pose data is captured at VERY regular intervals of 0.033328 seconds.
Depth data is captured at pretty regular intervals of 0.2 seconds.
Image data is captured at odd intervals
with 3 or 4 frames at 0.033 seconds
then 1 frame at about 0.100 seconds
often followed by a second frame with the same timestamp
(even though it is not reported until the next onFrameAvailable()?)
That is the actual timestamp data in the returned structs. The "real?" elapsed time between callbacks is much more variable. The pose callback fires anywhere from 0.010 to 0.079 seconds, even though the pose timestamps are rock solid at 0.033. The image (frame) callback fires 4 times at between 0.025 and 0.040 and then gives one long pause of around 0.065. That is where two images with the same timestamp are returned in successive calls. It appears that the camera is skipping a frame?
So, to match depth, image, and pose you really need to buffer multiple returns with their corresponding timestamps (ring buffer?) and then match them up by whichever value you want as master. Pose times are the most stable.
Note: I have not tried to get a pose for a particular "in between" time to see if the returned pose is interpolated between the values given by onPoseAvailable().
I have the logcat file and various awk extracts available. I am not sure how to post those (1000's of lines).
I think the fundamental question would be how to sync the pose, depth and color image data together into a single frame. So to answer that, there are actually two step
Sync pose to either color image or depth: to do that, the simplest way is to use the TangoService_getPoseAtTime function, that basically gives you the ability to query a pose with certain timestamp. i.e, you have a depth point cloud available, and it gives you a timestamp of that depth frame, then you could use the depth point cloud timestamp to query the corresponding pose.
Sync color image and depth image: currently, you would have to buffer either the depth point cloud or the color image at the application level, and base on one of their's timestamp, query the other's data in the buffer. There is a field name color_image in the TangoXYZij data structure, and the comment says it's reserved for future use, so the built-in sync up feature might be coming in future releases.

How to select the specific frame with object

I am detecting the object from the live camera through feature detection with svm , and it read every frame from camera while predicting which affect its speed , i just want that it should select the frame which contain the object and ignore other frames which have no object like empty street or standing car's , it should only detect the moving object
For example , If the object came into camera in 6th frame , it contain into the camera till many frames until it goes out from camera's range , so it should not recount the same object and ignore that frames.
Explanation :
I am detecting the vehicle from video , i want to ignore the empty frames , but how to ignore them ? i only want to check the frames which contain object like vehicle , but if the vehicle is passing from video it take approximately lets assume 5 sec , than it mean same object take 10 frames , so the program count it as 10 vehicles , one from each frame , i want to count it as 1 , because its the one (SAME) vehicle which use 10 frames
My video is already in background subtraction form
I explore two techniques :
1- Entropy ( Frame subtraction )
2- Keyframe extraction
This question is confusingly worded. What output do you want from this analysis? Here's the stages I see:
1) I assume each frame gives you an (x,y) or null for the position of each object in the frame. Can you do this?
2) If you might get multiple objects in a frame, you have to match them with objects in the previous frame. If this is not a concern, skip to (3). Otherwise, assign an index to each object in the first frame they appear. In subsequent frames, match each object to the index in the previous frame based on (x,y) distance. Clumsy, but it might be good enough.
3) Calculating velocity. Look at the difference in (x,y) between this frame and the last one. Of course, you can't do this on the first frame. Maybe apply a low-pass filter to position to smooth out any jittery motion.
4) Missing objects. This is a hard one. If your question is how to treat empty frames with no object in them, then I feel like you just ignore them. But, if you want to track objects that go missing in the middle of a trajectory (like maybe a ball with motion blur) then that is harder. If this is what you're going for, you might want to do object matching by predicting the next position using position, velocity, and maybe even object characteristics (like a histogram of hues).
I hope this was helpful.
You need an object tracker (many examples can be found on the web for tracking code). Then what you are looking for is the number of tracks. That's your answer.

Face matching after detection

I am working on a software that will match a captured image (face) with 3/4 images (faces) of same person. Now there are 2 possibilities
1- That the captured image (face) is of the same person whose 3/4 images (faces) are already stored in database
2. Captured image is of a different person
Now i want to get the results of the above 2 scenarios, i.e matched in case 1 and not matched in case 2. I used 40 Gabor filters so that i can get good results. Moreover i get the results in an array (histogram). But It don't seems to work fine and environment conditions like light also have influence on the matching process. Can any one suggest me a good and efficient technique for this thing to achieve.
Well, This is basically face identification problem.
You can use LBP(Local Binary Pattern) to extract features from images.LBP is very robust and llumination invariance method.
You can try following steps-
Training:-
Extract face region (using OpenCV HaarCascade)
Re-size all the extracted face regions to equal size
Divide resized face into sub regions(Ex: 8*9)
Extract LBP features from each region and concatenate them, , because localization of feature is very important
Train SVM by this concatenated feature, with different label to each different person's image
Testing:-
Take a face image and Follow step 1 to 4
Predict using SVM(about which person's image this is)

What is 6-tap filter and how they differ across codecs?

I found in one research on VP8 decoding phrase "6-tap filter in any case will be a 6-tap filter, and the difference is usually only in the coefficients". So what is 6-tap filter, how it works?
So can any one please explain what is 6-tap filter and how they differ across codecs?
There are two places in video codecs where these filters are typically used:
Motion estimation/compensation
Video codecs compress much better than still image codecs because they also remove the redundancy between frames. They do this using motion estimation and motion compensation. The encoder splits up the image into rectangular blocks of image data (typically 16x16) and then tries to find the block in a previously coded frame that is as similar as possible to the block that's currently being coded. The encoder then only transmits the difference, and a pointer to where it found this good match. This is the main reason why video codecs get about 1:100 compression, where image codecs get 1:10 compression. Now, you can imagine that sometimes camera or object in the scene didn't move by a full pixel, but actually by half or a quarter pixel. There's then a better match found if the image is scaled/interpolated, and these filters are used to do that. The exact way they do this filtering often differs per codec.
Deblocking
Another reason for using such a filter is to remove artifacts from the transform that's being used. Just like in still-image coding, there's a transform that transforms the image data into a different space that "compacts the energy". For instance, after this transform, those image sections that have a uniform color, like a blue sky, will result into data that has just a single number for the color, and then all zeros for the rest of the data. Comparing this to the original data, which stores blue for all of the pixels, a lot of redundancy has been removed. After the transform (Google for DCT, KLT, integer transform), the zeros are typically thrown away, and the other not so relevant data that's left is coded with fewer bits than in the original. During image decoding, since data has been thrown away, this often results in edges between the 8x8 or 16x16 of neighboring blocks. There's a separate smoothing filter that then smoothens these edges again.
A 6 tap filter is a 6th order FIR or IIR filter (probably FIR). The coefficients will determine the frequency response of the filter. Without knowing the structure, coefficients and the sample rate you can't really say much more about the filter.

Resources