Saving a rigid-body simulation as an animation

Saving a rigid-body simulation as an animation - animation

I work in autonomous robotics. I will often simulate a robot without visualization, export position and rotation data to a file at ~30 fps, and then play that file back at a later time. Currently, I save the animation data in a custom-format JSON file and animate using three.js.
I am wondering if there is a better way to export this data?
I am not well versed in animation, but I suspect that I could be exporting to something like COLLADA or glTF and gain the benefits of using a format that many systems are already setup to import.
I have a few questions (some specific and some general):
How do animations usually get exported in these formats? It seems that most of them have something to do with the skeletons or morphing, but neither of concepts appear to apply to my case. (Could I get a pointer to an overview of general animation concepts?)
I don't really need key-framing. Is it reasonable to have key-frames at 30 to 60 fps without any need for interpolation?
Do any standard animation formats save data in a format that doesn't assume some form of interpolation?
Am I missing something? I'm sure my lack of knowledge in the area has hidden something that is obvious to animators.

You specifically mentioned autonomous robots, and position and rotation in particular. So I assume that the robot itself is the level of granularity that is supposed to be stored here. (Just to differentiate it from an articulated robot - basically a manipulator ("arm") with several rotational or translational joints that may have different angles)
For this case, a very short, high-level description about how this could be stored in glTF(*):
You would store the robot (or each robot) as one node of a glTF asset. Each of the nodes can contain a translation and rotation property (given as a 3D vector and a quaternion). These nodes would then simply describe the position and orientation of your robots. You could imagine the roboty being "attached" to these nodes. (In fact, you can attach a mesh to these nodes in glTF, which then could be the visual representation of the robot).
The animation data itself would then be a description about how these properties (translation and rotation) change over time. The way how this information is stored can be imagined as a table, where you associate the translation and rotation with each time stamp:
time (s) 0.1 0.2 ... 1.0
translation x 1.2 1.3 ... 2.3
translation y 3.4 3.4 ... 4.3
translation z 4.5 4.6 ... 4.9
rotation x 0.12 0.13 ... 0.42
rotation y 0.32 0.43 ... 0.53
rotation z 0.14 0.13 ... 0.34
rotation w 0.53 0.46 ... 0.45
This information is then stored, in a binary form, and provided by so-called accessor objects.
The animation of a glTF asset then basically establishes the connection between this binary animation data, and the properties in the node that are affected by that: Each animation refers to such a "data table", and to the node whose properties will be filled with the new translation and rotation value as time progresses.
Regarding interpolation:
In your case, where the output is sampled at a high rate from the simulation, basically each frame is a "key frame", and no explicit information about key frames or the interpolation scheme will have to be stored. Just declaring that the animation interpolation should be of the type LINEAR or STEP should be sufficient for this use case.
(The option to declare it as a LINEAR interpolation will mainly relevant for the playback. Imagine you stop your playback exactly after 0.15 seconds: Should it then show the state that the robot had at the time stamp 0.1 or the state at time stamp 0.2, or one that is interpolated linearly? This, however, would mainly apply to a standard viewer, and not necessarily to a custom playback)
(*) A side note: On a conceptual level, the way of how the information is represented in glTF and COLLADA is similar. Roughly speaking, COLLADA is an interchange format for authoring applications, and glTF is a transmission format that can efficiently be transferred and rendered. So although the answers until now refer to glTF, you should consider COLLADA as well, depending on your priorities, use-cases or how the "playback" that you mentioned is supposed to be implemented.
Disclaimer: I'm a glTF contributor as well. I also created the glTF tutorial section showing a simple animation and the one that explains some concepts of animations in glTF. You might find them useful, but they obviously build upon some of the concepts that are explained in the earlier sections.

The type of animation you describe is often called "baked" animation, where some calculation has been sampled, possibly at 30 ~ 60 fps, with keyframes saved at the high sample rate. For such animations, usually linear interpolation is applied. For example, in Blender, there's a way to run the Blender Game Engine and record the physics simulation to (dense) keyframes.
As for interpolation, here's a thought experiment: Consider for a moment a polygon-based render engine wants to render a circle, but must use only straight lines. Some limited number of points are calculated around the edge of the circle, and dozens or hundreds of small line segments fill in the gaps between the points. With enough density, or with the camera far enough back, it looks round, but the line segments ensure there are no leaks or gaps in the would-be circle. The same concept applies (in time rather than in space) to baked keyframes. There's a high sample density, and straight lines (linear interpolation) fill in the gaps. If you play it in super-slow motion, you might be able to detect subtle changes in speed as new keyframes are reached. But at normal speed, it looks normal, and the frame rate doesn't need to stay locked to the sample rate.
There's a section on animations for glTF 2.0 that I'll recommend reading here (disclaimer, I'm a glTF contributor and member of the working group). In particular, look at the descriptions of node-based animations with linear interpolation.
For robotics, you'll want to steer clear of skins and skeleton-based animation. Such things are not always compatible with node-based animations anyway (we've run into problems there just recently). The node-based animations are much more applicable to non-deforming robots with articulated joints and such.

Related

Augment reality like zookazam

What algorithms are used for augmented reality like zookazam ?
I think it analyze image and find planes by contrast, but i don't know how.
What topics should I read before starting with app like this?

[Prologue]
This is extremly broad topic and mostly off topic in it's current state. I reedited your question but to make your question answerable within the rules/possibilities of this site
You should specify more closely what your augmented reality:
should do
adding 2D/3D objects with known mesh ...
changing light conditions
adding/removing body parts/clothes/hairs ...
a good idea is to provide some example image (sketch) of input/output of what you want to achieve.
what input it has
video,static image, 2D,stereo,3D. For pure 2D input specify what conditions/markers/illumination/LASER patterns you have to help the reconstruction.
what will be in the input image? empty room, persons, specific objects etc.
specify target platform
many algorithms are limited to memory size/bandwidth, CPU power, special HW capabilities etc so it is a good idea to add tag for your platform. The OS and language is also a good idea to add.
[How augmented reality works]
acquire input image
if you are connecting to some device like camera you need to use its driver/framework or something to obtain the image or use some common API it supports. This task is OS dependent. My favorite way on Windows is to use VFW (video for windows) API.
I would start with some static file(s) from start instead to ease up the debug and incremental building process. (you do not need to wait for camera and stuff to happen on each build). And when your App is ready for live video then switch back to camera...
reconstruct the scene into 3D mesh
if you use 3D cameras like Kinect then this step is not necessary. Otherwise you need to distinguish the object by some segmentation process usually based on the edge detections or color homogenity.
The quality of the 3D mesh depends on what you want to achieve and what is your input. For example if you want realistic shadows and lighting then you need very good mesh. If the camera is fixed in some room you can predefine the mesh manually (hard code it) and compute just the objects in view. Also the objects detection/segmentation can be done very simply by substracting the empty room image from current view image so the pixels with big difference are the objects.
you can also use planes instead of real 3D mesh as you suggested in the OP but then you can forget about more realistic quality of effects like lighting,shadows,intersections... if you assume the objects are standing straight then you can use room metrics to obtain the distance from camera. see:
selection criteria for different projections
estimate measure of photographed things
For pure 2D input you can also use the illumination to estimate the 3D mesh see:
Turn any 2D image into 3D printable sculpture with code
render
Just render the scene back to some image/video/screen... with added/removed features. If you are not changing the light conditions too much you can also use the original image and render directly to it. Shadows can be achieved by darkening the pixels ... For better results with this the illumination/shadows/spots/etc. are usually filtered out from the original image and then added directly by rendering instead. see
White balance (Color Suppression) Formula?
Enhancing dynamic range and normalizing illumination
The rendering process itself is also platform dependent (unless you are doing it by low level graphics in memory). You can use things like GDI,DX,OpenGL,... see:
Graphics rendering
You also need camera parameters for rendering like:
Transformation of 3D objects related to vanishing points and horizon line
[Basic topics to google/read]
2D
DIP digital image processing
Image Segmentation
3D
Vector math
Homogenous coordinates
3D scene reconstruction
3D graphics
normal shading
paltform dependent
image acquisition
rendering

Shading mask algorithm for radiation calculations

I am working on a software (Ruby - Sketchup) to calculate the radiation (sun, sky and surrounding buildings) within urban development at pedestrian level. The final goal is to be able to create a contour map that shows the level of total radiation. With total radiation I mean shortwave (light) and longwave(heat). (To give you an idea: http://www.iaacblog.com/maa2011-2012-digitaltools/files/2012/01/Insolation-Analysis-All-Year.jpg)
I know there are several existing software that do this, but I need to write my own as this calculation is only part of a more complex workflow.
The (obvious) pseudo code is the following:
Select and mesh surface for analysis
From each point of the mesh
Cast n (see below) rays in the upper hemisphere (precalculated)
For each ray check whether it is in shade
If in shade => Extract properties from intersected surface
If not in shade => Flag it
loop
loop
loop
The approach above is brute force, but it is the only I can think of. The calculation time increases with the fourth power of the accuracy (Dx,Dy,Dazimth, Dtilt). I know that software like radiance use a Montecarlo approach to reduce the number of rays.
As you can imagine, the accuracy of the calculation for a specific point of the mesh is strongly dependent by the accuracy of the skydome subdivision. Similarly the accuracy on the surface depends on the coarseness of the mesh.
I was thinking to a different approach using adaptive refinement based on the results of the calculations. The refinement could work for the surface analyzed and the skydome. If the results between two adjacent points differ more than a threshold value, than a refinement will be performed. This is usually done in fluid simulation, but I could not find anything about light simulation.
Also i wonder whether there are are algorithms, from computer graphics for example, that would allow to minimize the number of calculations. For example: check the maximum height of the surroundings so to exclude certain part of the skydome for certain points.
I don't need extreme accuracy as I am not doing rendering. My priority is speed at this moment.
Any suggestion on the approach?
Thanks
n rays
At the moment I subdivide the sky by constant azimuth and tilt steps; this causes irregular solid angles. There are other subdivisions (e.g. Tregenza) that maintain a constant solid angle.
EDIT: Response to the great questions from Spektre
Time frame. I run one simulation for each hour of the year. The weather data is extracted from an epw weather file. It contains, for each hour, solar altitude and azimuth, direct radiation, diffuse radiation, cloudiness (for atmospheric longwave diffuse). My algorithm calculates the shadow mask separately then it uses this shadow mask to calculate the radiation on the surface (and on a typical pedestrian) for each hour of the year. It is in this second step that I add the actual radiation. In the the first step I just gather information on the geometry and properties of the various surfaces.
Sun paths. No, i don't. See point 1
Include reflection from buildings? Not at the moment, but I plan to include it as an overall diffuse shortwave reflection based on sky view factor. I consider only shortwave reflection from the ground now.
Include heat dissipation from buildings? Absolutely yes. That is the reason why I wrote this code myself. Here in Dubai this is key as building surfaces gets very, very hot.
Surfaces albedo? Yes, I do. In Skethcup I have associated a dictionary to every surface and in this dictionary I include all the surface properties: temperature, emissivity, etc.. At the moment the temperatures are fixed (ambient temperature if not assigned), but I plan, in the future, to combine this with the results from a building dynamic thermal simulation that already calculates all the surfaces temperatures.
Map resolution. The resolution is chosen by the user and the mesh generated by the algorithm. In terms of scale, I use this for masterplans. The scale goes from 100mx100m up to 2000mx2000m. I usually tend to use a minimum resolution of 2m. The limit is the memory and the simulation time. I also have the option to refine specific areas with a much finer mesh: for example areas where there are restaurants or other amenities.
Framerate. I do not need to make an animation. Results are exported in a VTK file and visualized in Paraview and animated there just to show off during presentations :-)
Heat and light. Yes. Shortwave and longwave are handled separately. See point 4. The geolocalization is only used to select the correct weather file. I do not calculate all the radiation components. The weather files I need have measured data. They are not great, but good enough for now.
https://www.lucidchart.com/documents/view/5ca88b92-9a21-40a8-aa3a-0ff7a5968142/0

visible light
for relatively flat global base ground light map I would use projection shadow texture techniques instead of ray tracing angular integration. It is way faster with almost the same result. This will not work on non flat grounds (many bigger bumps which cast bigger shadows and also change active light absorbtion area to anisotropic). Urban areas are usually flat enough (inclination does not matter) so the technique is as follows:
camera and viewport
the ground map is a target screen so set the viewpoint to underground looking towards Sun direction upwards. Resolution is at least your map resolution and there is no perspective projection.
rendering light map 1st pass
first clear map with the full radiation (direct+diffuse) (light blue) then render buildings/objects but with diffuse radiation only (shadow). This will make the base map without reflections and or soft shadows in the Magenta rendering target
rendering light map 2nd pass
now you need to add building faces (walls) reflections for that I would take every outdoor face of the building facing Sun or heated enough and compute reflection points onto light map and render reflection directly to map
in tis parts you can add ray tracing for vertexes only to make it more precise and also for including multiple reflections (bu in that case do not forget to add scattering)
project target screen to destination radiation map
just project the Magenta rendering target image to ground plane (green). It is only simple linear affine transform ...
post processing
you can add soft shadows by blurring/smoothing the light map. To make it more precise you can add info to each pixel if it is shadow or wall. Actual walls are just pixels that are at 0m height above ground so you can use Z-buffer values directly for this. Level of blurring depends on the scattering properties of the air and of coarse pixels at 0m ground height are not blurred at all
IR
this can be done in similar way but temperature behaves a bit differently so I would make several layers of the scene in few altitudes above ground forming a volume render and then post process the energy transfers between pixels and layers. Also do not forget to add the cooling effect of green plants and water vaporisation.
I do not have enough experience in this field to make more suggestions I am more used to temperature maps for very high temperature variances in specific conditions and material not the outdoor conditions.
PS. I forgot albedo for IR and visible light is very different for many materials especially aluminium and some wall paintings

How game engine rotates models?

if i do a human model and import him to game engine. does game engine knows all point cordinates on model and rotates each ones? all models consists million points and and if i rotate a model 90 degree , does game engine calculates millions point new location and rotate? how does it works. Thanks

This is a bit of a vague question since each game engine will work differently, but in general the game engine will not touch the model coordinates.
Models are usually loaded with model space (or local space) coordinates - this simply means that each vertex is defined with a location relative to the origin of that model. The origin is defined as (0,0,0) and is the point around which rotations take place.
Now the game engine loads and keeps the model in this coordinate space. Then you provide your transformations (such as translation and rotation matrices) to place that model somewhere in your "world" (i.e. the global coordinate space shared by all objects). You also provide the way you want to view this world with various other transforms such projection and view matrices.
The game engine then takes all of these transformations and passes them to the GPU (or software renderer, in some cases) - it will also setup other stuff such as textures, etc. These are usually set once per frame (or per object for a frame).
Finally, it then passes each vertex that needs to be processed to the renderer. Each vertex is then transformed by the renderer using all the transformations specified to get a final vertex position - first in world space and then in screen space - which it can use to render pixels based on various other information (such as textures and lighting).
So the point is, in most cases, the engine really has nothing to do with the rotation of the model/vertices. It is simply a way to manage the model and the various settings that apply to it.
Of course, the engine can rotate the model and modify it's vertices, but this is usually only done during loading - for example if the model needs to be converted between different coordinate spaces.
There is a lot more going on, and this is a very basic description of what actually happens. There are many many sources that describe this process in great detail, so I won't even try to duplicate it. Hopefully this gives you enough detail to understand the basics.

Where to find information on 3D algorithms?

I am interested in learning about 3D video game development, but am not sure where to start really.
Instead of just making it which could be done by various game makers, I am more interested in how it is done.
Ideally, I would like to know in which format general 3D models, etc. are stored.(coordinate format etc.) and information on how to represent the 3D data on the screen from a certain perspective such as in general free roaming 3D video games like Devil May Cry.
I have seen some links regarding 3D matrices but I really don't understand how they are used. Any help for beginners would be much appreciated.
Thanks

Video game development is a huge field requiring knowledge in game theory, computer science, math, physics and art. Depending on what you want to specialize on, there are different starting points. But as this is a site for programming questions, here some insights on the programming part of it:
File formats
Assets (models, textures, sounds) are created using dedicated 3rd party tools (think of Gimp, Photoshop, Blender, 3ds Max, etc), which offer a wide range of different export formats. These formats usually have one thing in common: They are optimized for simple communication between applications.
Video games have high performance requirements and assets have to be loaded and unloaded all the time during gameplay. So the content has to be in a format that is compact and loads fast. Often 3rd party formats do not meet the specific requirements you have in your game project. For optimal performance you would want to consider developing your own format.
Examples of assets and common 3rd party formats:
Textures: PNG, JPG, BMP, TGA
3D models: OBJ, 3DS, COLLADA
Sounds: WAV, MP3
Additional examples
Textures in Direct3D
In my game project I use an importer that converts my textures from one of the aforementioned formats to DDS files. This is not a format I developed myself, still it is one of the fastest available for loading with Direct3D (Graphics API).
Static 3D models
The Wavefront OBJ file format is a very simple to understand, text-based format. Most 3D modelling applications support it. But since it is text based the files are much larger than equivalent binary files. Also they require lots of expensive parsing and processing. So I developed an importer that coverts OBJ models to my custom high performance binary format.
Wave sound files
WAV is a very common sound file format. Additionally it is quite ideal for using it in a game. So no custom format is necessary in this case.
3D graphics
Rendering a 3D scene at least 30 times per second to an average screen resolution requires quite a lot calculations. For this purpose GPUs were built. While it is possible to write any kind of program for the GPU using very low level languages, most developers use an abstraction like Direct3D or OpenGL. These APIs, while restricting the way of communicating with the GPU, greatly simplify graphics related tasks.
Rendering using an API
I have only worked with Direct3D so far, but some of this should apply to OpenGL as well.
As I said, the GPU can be programmed. Direct3D and OpenGL both come with their own GPU programming language, a.k.a. Shading Language: HLSL (Direct3D) and GLSL. A program written in one of those languages is called a Shader.
Before rendering a 3D model the graphics device has to be prepared for rendering. This is done by binding the shaders and other effect states to the device. (All of this is done using the API.)
A 3D model is usually represented as a set of vertices. For example, 4 vertices for a rectangle, 8 for a cube, etc. These vertices consist of multiple components. The absolute minimum in this cases would be a position component (3 floating point numbers representing the X, Y and Z offsets in 3D space). Also, a position is just an infinitely small point. So additionally we need to define how the points are connected to a surface.
When the vertices and triangles are defined they can be written to the memory of the GPU. If everything is correctly set, we can issue a Draw Call through the API. The GPU then executes your shaders an processes all the input data. In the last step the rendered triangles are written to the defined output (the screen, for example).
Matrices in 3D graphics
As I said before, a 3D mesh consists of vertices with a position in 3D space. This positions are all embedded in a coordinate system called object space.
In order to place the object in the world, move, rotate or scale it, these positions have to be transformed. In other words, they have to be embedded in another coordinate system, which in this case would be called world space.
The simplest and most efficient way to do this transformation is matrix multiplication: From the translation, rotation and scaling amounts a 4x4 matrix is constructed. This matrix is then multiplied with each and every vertex. (The math behind it is quite interesting, but not in the scope of this question.)
Besides object and world space there is also the view space (coordinate system of the 'camera'), clip space, screen space and tangent space (on the surface of an object). Vectors have to be transformed between those coordinate systems quite a lot. So you see, why matrices are so important in 3D graphics.
How to continue from here
Find a topic that you think is interesting and start googling. I think I gave you quite a few keywords and I hope I gave you some idea of the topics you mentioned specifically.
There is also a Game Development Site in the StackExchange framework which might be better suited for this kind of questions. The top voted questions are always a good read on any SE site.

Basically the first decision is wether to use OpenGL or DirectX.
I suggest you use OpenGL because its Platform independent and can also be used for mobile devices.
For OpenGL here are some good tutorials to get you started:
http://www.opengl-tutorial.org/

What is an algorithm I can use to program an image compare routine to detect changes (like a person coming into the frame of a web cam)?

I have a web cam that takes a picture every N seconds. This gives me a collection of images of the same scene over time. I want to process that collection of images as they are created to identify events like someone entering into the frame, or something else large happening. I will be comparing images that are adjacent in time and fixed in space - the same scene at different moments of time.
I want a reasonably sophisticated approach. For example, naive approaches fail for outdoor applications. If you count the number of pixels that change, for example, or the percentage of the picture that has a different color or grayscale value, that will give false positive reports every time the sun goes behind a cloud or the wind shakes a tree.
I want to be able to positively detect a truck parking in the scene, for example, while ignoring lighting changes from sun/cloud transitions, etc.
I've done a number of searches, and found a few survey papers (Radke et al, for example) but nothing that actually gives algorithms that I can put into a program I can write.

Use color spectroanalisys, without luminance: when the Sun goes down for a while, you will get similar result, colors does not change (too much).
Don't go for big changes, but quick changes. If the luminance of the image changes -10% during 10 min, it means the usual evening effect. But when the change is -5%, 0, +5% within seconds, its a quick change.
Don't forget to adjust the reference values.
Split the image to smaller regions. Then, when all the regions change same way, you know, it's a global change, like an eclypse or what, but if only one region's parameters are changing, then something happens there.
Use masks to create smart regions. If you're watching a street, filter out the sky, the trees (blown by wind), etc. You may set up different trigger values for different regions. The regions should overlap.
A special case of the region is the line. A line (a narrow region) contains less and more homogeneous pixels than a flat area. Mark, say, a green fence, it's easy to detect wheter someone crosses it, it makes bigger change in the line than in a flat area.
If you can, change the IRL world. Repaint the fence to a strange color to create a color spectrum, which can be identified easier. Paint tags to the floor and wall, which can be OCRed by the program, so you can detect wheter something hides it.

I believe you are looking for Template Matching
Also i would suggest you to look on to Open CV

We had to contend with many of these issues in our interactive installations. It's tough to not get false positives without being able to control some of your environment (sounds like you will have some degree of control). In the end we looked at combining some techniques and we created an open piece of software named OpenTSPS (Open Toolkit for Sensing People in Spaces - http://www.opentsps.com). You can look at the C++ source in github (https://github.com/labatrockwell/openTSPS/).
We use ‘progressive background relearn’ to adjust to the changing background over time. Progressive relearning is particularly useful in variable lighting conditions – e.g. if lighting in a space changes from day to night. This in combination with blob detection works pretty well and the only way we have found to improve is to use 3D cameras like the kinect which cast out IR and measure it.
There are other algorithms that might be relevant, like SURF (http://achuwilson.wordpress.com/2011/08/05/object-detection-using-surf-in-opencv-part-1/ and http://en.wikipedia.org/wiki/SURF) but I don't think it will help in your situation unless you know exactly the type of thing you are looking for in the image.
Sounds like a fun project. Best of luck.

The problem you are trying to solve is very interesting indeed!
I think that you would need to attack it in parts:
As you already pointed out, a sudden change in illumination can be problematic. This is an indicator that you probably need to achieve some sort of illumination-invariant representation of the images you are trying to analyze.
There are plenty of techniques lying around, one I have found very useful for illumination invariance (applied to face recognition) is DoG filtering (Difference of Gaussians)
The idea is that you first convert the image to gray-scale. Then you generate two blurred versions of this image by applying a gaussian filter, one a little bit more blurry than the first one. (you could use a 1.0 sigma and a 2.0 sigma in a gaussian filter respectively) Then you subtract from the less-blury image, the pixel intensities of the more-blurry image. This operation enhances edges and produces a similar image regardless of strong illumination intensity variations. These steps can be very easily performed using OpenCV (as others have stated). This technique has been applied and documented here.
This paper adds an extra step involving contrast equalization, In my experience this is only needed if you want to obtain "visible" images from the DoG operation (pixel values tend to be very low after the DoG filter and are veiwed as black rectangles onscreen), and performing a histogram equalization is an acceptable substitution if you want to be able to see the effect of the DoG filter.
Once you have illumination-invariant images you could focus on the detection part. If your problem can afford having a static camera that can be trained for a certain amount of time, then you could use a strategy similar to alarm motion detectors. Most of them work with an average thermal image - basically they record the average temperature of the "pixels" of a room view, and trigger an alarm when the heat signature varies greatly from one "frame" to the next. Here you wouldn't be working with temperatures, but with average, light-normalized pixel values. This would allow you to build up with time which areas of the image tend to have movement (e.g. the leaves of a tree in a windy environment), and which areas are fairly stable in the image. Then you could trigger an alarm when a large number of pixles already flagged as stable have a strong variation from one frame to the next one.
If you can't afford training your camera view, then I would suggest you take a look at the TLD tracker of Zdenek Kalal. His research is focused on object tracking with a single frame as training. You could probably use the semistatic view of the camera (with no foreign objects present) as a starting point for the tracker and flag a detection when the TLD tracker (a grid of points where local motion flow is estimated using the Lucas-Kanade algorithm) fails to track a large amount of gridpoints from one frame to the next. This scenario would probably allow even a panning camera to work as the algorithm is very resilient to motion disturbances.
Hope this pointers are of some help. Good Luck and enjoy the journey! =D

Use one of the standard measures like Mean Squared Error, for eg. to find out the difference between two consecutive images. If the MSE is beyond a certain threshold, you know that there is some motion.
Also read about Motion Estimation.

if you know that the image will remain reletivly static I would reccomend:
1) look into neural networks. you can use them to learn what defines someone within the image or what is a non-something in the image.
2) look into motion detection algorithms, they are used all over the place.
3) is you camera capable of thermal imaging? if so it may be worthwile to look for hotspots in the images. There may be existing algorithms to turn your webcam into a thermal imager.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio