Algorithm to monitor file changes - algorithm

What's a good way to monitor and find the optimal times when specific files on remote sites change? I want to limit how often we have to download a file by finding the pattern of when the file is generally updated...
We download files (product feeds) with data ranging from 1Mb to 200Mb on a regular basis
Some of these files are updated every hour, some a few days a week, others once a month
The files aren't always updated at the exact same time, but there's generally a pattern within a certain period
We only want to download the files when we know they've changed
We want to download the files as soon as possible after they've changed
A simple way to solve this would be to check the files using a HTTP HEAD request every hour and trigger the download when we notice a change in Last-modified or Content-Length. Unfortunately we can't rely on the HTTP headers as they're generally missing or give no indication as to the actual time/size of the file. We often have to download the whole file just to determine if it's changed.
First I thought I could write a process that checks the file every 1, 2, 4, 8, ... hours (doubling for each iteration) until it found that the file had changed and then just stick with that number. This probably works, but it's not optimal.
To optimize it a bit I thought of tweaking the interval number to find a sweet spot. Then all kinds of scenarios started appearing where my ideas would fail - such as weekends and public holidays when the files wouldn't be updated because people aren't at work. There is a pattern, but there are exceptions to it.
Next I started reading about "step detection" algorithms and soon realized I was way out of my depth. How do people solve these problems?
I'm guessing the solution will involve some form of data history, but I fumble with how to optimize the algorithm that collect the data and how to derive the pattern. Hoping someone has dealt with it before.

Related

How Duplicate File search is implemented in Gemini For Mac os

I tried to search for Duplicate files in my mac machine via command line.
This process took almost half an hour for 10 gb Data files whereas Gemini and cleanmymac apps takes lesser time to find the files.
So my point here is how this fastness is achieved in these apps,what is the concept behind it?, in which language code is written.
I tried googling for information but didnot get anything related to duplicate finder.
if you have any ideas please input them here.
First of all Gemini locates files with equal size, than it uses it’s own hash-like type-dependent algorithm to compare files content. That algorithm is not 100% accurate but much more quick than classical hashes.
I contacted support, asking them what algorithm they use. Their response was that they compare parts of each file to each other, rather than the whole file or doing a hash. As a result, they can only check maybe 5% (or less) of each file that's reasonably similar in size to each other, and get a reasonably accurate result. Using this method, they don't have to pay the cost of comparing the whole file OR the cost of hashing files. They could be even more accurate, if they used this method for the initial comparison, and then did full comparisons among the potential matches.
Using this method, files that are minor variants of each other may be detected as identical. For example, I've had two songs (original mix and VIP mix) that counted as the same. I also had two images, one with a watermark and one without, listed as identical. In both these cases, the algorithm just happened to pick parts of the file that were identical across the two files.

Good algorithm for marking a media file as listened

I'm writing a media player and want to mark media files as listened in order to filter them.
However, I'm lacking a good idea for when to mark a song/video as listened/watched.
Movies tends to create the largest problem. You might not watch the credits in the last two minutes, and you might skip around.
I guess I could track the total amount of played seconds, but this causes problems if the first half is watched twice for some reason. Keeping track of which parts of the movies has been played seems like a huge mess.
One of the best solutions I have come up with is to mark the movie/song as listened/watched if the use has played more than X seconds in the last 10% of the file. Then I would be reasonable to assume he has listened to most of it and/or listened/watched what he wanted.
However, all the solutions above are bad, and I would really like some input
What about another approach?
If the user doesn't hit the next/prev/random button or closes in more than half of the media, then that file is listened/watched. You may need to track the time watched, and taking care of time overlapping (watching the first two minutes and then watching again the first one minute doesn't mean the user watched 3 minutes of your file).
In my opinion, I'd play more with the skipping / closing feature, rather than checking if the user listened to X more than Y in the last 10%. You could anyway move to the last 10% and then the media would be marked as viewed.
However, my solution is not as accurate as it should be, and maybe there isn't one. Maybe you should look into the UX site.

Webpage updates detection algorithms

First of all, I'm not looking for code, just a plain discussion about approaches regarding what the subject says.
I was wondering lately on how really the best way to detect (as fast as possible) changes to website pages, assuming I have 100K websites, each has an unknown amount of pages, do a crawler really needs to visit each and every one of them once in a while?
Unless they have RSS feeds (which you would still need to pull to see if they have changed) there really isn't anyway to find out when the site has changed except by going to it and checking. However you can do some smart things to be more efficient. After you have been checking on the site for a while you can build a prediction model of when they tend to update. For example: this news site updates every 2-3 hours but that blog only makes about a post a week. This can save you many checks because the majority of pages don't actually update that often. Google does this to help with its pulling. One simple algorithm which will work for this (depending on how cutting edge you need your news to be) is the following of my own design based on binary search:
Start each site off with a time interval ~ 1 day
Visit the sites when that time hits and check changes
if something has changed
halve the time for that site
else
double the time for that site
If after many iterations you find it hovering around 2-3 numbers
fix the time on the greater of the numbers
Now this is a simple algorithm for finding which times are right for checking but you can probably do something more effective if you parse the text and see patterns in times when the updates were actually posted.

Faster searching through files in Perl

I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!
It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.
You could probably replace this:
foreach file in arrayofDataReferenceFiles
search in these files
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
dbmopen HASH,DBNAME,MASK
This binds a dbm(3), ndbm(3), sdbm(3), gdbm(3), or Berkeley DB file to a hash.
You'd normally access this stuff through tie but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.
As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.

Synchronization algorithm

Basically I need to synchronize two folder/file structures when both folders and files get moved around and changed quite often. They both have history of changes recorded and deltas can be queried by request. I already have some as I think reliable self-made sync algorithm tuned on the go when a problem arises. I was wondering if there is a mathematical background to this problem and probably some well-build theories and patterns I could reuse and improve my system.
Not sure I understand your question, but perhaps Longest common subsequence problem which is the base of diff programs: find out what is the difference between two states (ie. folders/files in your case) and encode the sequence of operations that translate state A into state B (what files need to be added, modified and removed for the two location to have the same structure). This kind of solution works if one of the locations is the 'golden' copy (or 'master') and the other one is 'slave': the slave has to reach the state of the master. When the situation is master-master (both sites accept writes) then is significantly more difficult to resolve it, and you need some sort of automated conflict resolution.

Resources