Hotlinking increases traffic/pageviews? - image

Several websites are hotlinking my images, I'm going to block them but if hotlinking helps traffic/page-views in any way at the cost of bandwidth, I won't.
Are there any advantages to hotlinking, if any?

I don't think by solely having links from other sites to images in your site increases your site's page rank in google, so unless you have a site in which images is the primary content of your site, I would try to block those requests.
If it is, I would reconsider implementing hotlink protection or at least the way in which its implemented, as you may in fact start blocking google image search results in the process.
Also,it seems that hotlinking may actually be bad for you if you do not take into consideration other SEO techniques as explained below:

Well, consider the following:
Is your server fast? - if it is, then just ignore. if it is NOT, you should block because the bandwidth you're sending over, your server has to process them - thus making it slow for real users of your website.
Track traffic from hotlinking - install Google Analytics or any other trackers to see if people actually come to your website because of those hotlinking - which i really doubt so.
Do they link you on those hotlinked images? if not, block.
Overall hotlinking isn't that good. In fact it may cause too much request to your images, not your webpages.

To include a watermark to an image when is hotlinked is a good option to increase your traffic.


Detect sexual content in image and text

I have started a social networking app and there is one user who won't stop uploading images of woman, who, well, are up to some sexual activities. He additionally adds offensive captions to them.
My question: how can I detect adult content in images and text and block them from my app? I think this is a problem that most people face who are making any kind of open networking app. It would be great if the solution was as fast and low-priced as possible.
Implement a system which essentially stores {256-sha image hash, human rating, computer rating} into a database.
Create an interface for the human rating and the computer rating which can judge and categorize images as well as an interface in your software which can use that information on how to handle such images.
Choose a tool, likely a convolutional neural network based algorithm, with an easy to use api. Here's a random result from searching:
Put everything together and you should have a system which can automatically guess how to handle images, but also allows you to iterate through them which both corrects the database as well as trains the rating algorithm which trains based on the existing human produced data.
Note: The status of an image is not permanent by the software unless a human rates it. Whenever one is accessed, the latest state of the image detection decides on it. If this happens far too frequently to support, then associate a time buffer with the image so that it doesn't re-rate it often.
Update: The advantages of this custom solution is that you can control things to work the way you want. You can define the rating system and how to handle the situation as well as governance over whatever set of trained algorithms you are using. You always have the final say and you can see what is going on at all times. The catch is that you would need to implement this software as an extension to your project.
Not easily, it would require machine learning techniques and a ton of training. Not to mention, all modern techniques can easily be tricked.
There are a few moderation solutions, but they aren't ideal.
First, you could ban them. Not the best, as they could make another account, but it means that they have to make another email for it.
Second, you could isolate him. I forget exactly how it works, but the idea is that they still think that they are posting on your app, but none of their content gets propagated to other people.
I don't know the legality of either of these, its all up to your terms and conditions. But AI is not really a good option, especially if your app were to need to scale.

Ideal Hostnames Number to Parallelize Downloads

We are setting up a CDN to serve CSS, JS and the images. We are wondering what will be the ideal number of hostnames to distribute the ressources across. This technique is use by many websites to increase parallelize downloads and page loading. However, DNS lookups slow down the page loading so the rule is not the more hostname you have, the more performance you will get.
I've read somewhere that the ideal number is between 2 and 4. I wonder if there is a rule of thumbs that apply to all webpages or if there is a rule of thumbs according to the number of ressources being served and the size of them.
Specific case : Our websites are composed of two kinds of pages. One kind serve a list of thumbnails (15-20 or so images, varying) and the other serve a flash or shockwave application (mostly games) with a lot less images. Obviously, we have regular JS, images and CSS on all pages. When I mean regular, that correctly optimized elements, 1 CSS, a few images of the UI, 2/3 JS...
I will love to have answers for our specific case but I will be also very interested to have general answers too!
We (Yahoo!) did a study back in 2007 that showed that 2 is a good number, and anything more than that doesn't improve page performance, in some cases having more than 2 domains degraded performance.
What I would recommend is - if you have a A/B testing infrastructure then go ahead and try it out on your site, use different number of domains and measure the page load time from your end users.
If you don't have a A/B testing framework then just try a different value for few days, measure it, try a new one, measure that ... do this till you find that point where performance starts to degrade.
There is no silver bullet for this recommendation. This is something that depends a lot on how many assets are there on each page and what browser (number of parallel downloads) your end users use. Hope that helps.

Is This Idea for Loading Online Content in Bulk Feasible?

I devised an idea a long time ago and never got around to implementing it, and I would like to know whether it is practical in that it would work to significantly decrease loading times for modern browsers. It relies on the fact that related tasks are often done more quickly when they are done together in bulk, and that the browser could be downloading content on different pages using a statistical model instead of being idle while the user is browsing. I've pasted below an excerpt from what I originally wrote, which describes the idea.
When people visit websites, I
conjecture that that a probability
density function P(q, t), where q is a
real-valued integer representing the
ID of a website and t is another
real-valued, non-negative integer
representing the time of the day, can
predict the sequence of webpages
visited by the typical human
accurately enough to warrant
requesting and loading the HTML
documents the user is going to read in
advance. For a given website, have the
document which appears to be the "main
page" of the website through which
users access the other sections be
represented by the root of a tree
structure. The probability that the
user will visit the root node of the
tree can be represented in two ways.
If the user wishes to allow a process
to automatically execute upon the
initialization of the operating system
to pre-fetch webpages from websites
(using a process elaborated later)
which the user frequently accesses
upon opening the web browser, the
probability function which determines
whether a given website will have its
webpages pre-fetched can be determined
using a self-adapting heuristic model
based on the user's history (or by
manual input). Otherwise, if no such
process is desired by the user, the
value of P for the root node is
irrelevant, since the pre-fetching
process is only used after the user
visits the main page of the website.
Children in the tree described earlier
are each associated with an individual
probability function P(q, t) (this
function can be a lookup table which
stores time-webpage pairs). Thus, the
sequences of webpages the user visits
over time are logged using this tree
structure. For instance, at 7:00 AM,
there may be a 71/80 chance that I
visit the "WTF" section on Reddit
after loading the main page of that
site. Based on the values of the
p> robability function P for each node
in the tree, chains of webpages
extending a certain depth from the
root node where the net probability
that each sequence is followed, P_c,
is past a certain threshold, P_min,
are requested upon the user visiting
the main page of the site. If the
downloading of one webpage finishes
before before another is processed, a
thread pool is used so that another
core is assigned the task of parsing
the next webpage in the queue of
webpages to be parsed. Hopefully, in
this manner, a large portion of those
webpages the user clicks may be
displayed much more quickly than they
would be otherwise.
I left out many details and optimizations since I just wanted this to be a brief description of what I was thinking about. Thank you very much for taking the time to read this post; feel free to ask any further questions if you have them.
Interesting idea -- and there have been some implementations for pre-fetching in browsers though without the brains you propose -- which could help alot. I think there are some flaws with this plan:
a) web page browsing, in most cases, is fast enough for most purposes.
b) bandwidth is becoming metered -- if I'm just glancing at the home page, do I as a user want to pay to serve the other pages. Moreover, in the cases where this sort of thing could be useful (eg--slow 3g connection), bandwidth tends to be more tightly metered. And perhaps not so good at concurrency (eg -- CDMA 3g connections).
c) from a server operator's point of view, I'd rather just serve requested pages in most cases. Rendering pages that don't ever get seen costs me cycles and bandwidth. If you are like alot of folks and on some cloud computing platform, you are paying by the cycle and the byte.
d) would require re-building lots of analytics systems, many of which still operate on the theory of request == impression
Or, the short summary is that there really isn't a need to pre-sage what people would view in order to speed serving and rendering pages. Now, places where something like this could be really useful would be in the "hey, if you liked X you probably liked Y" and then popping links and such to said content (or products) to folks.
Windows does the same thing with disk access - it "knows" that you are likely to start let's say Firefox at a certain time and preloads it.
SuperFetch also keeps track of what times of day those applications are used, which allows it to intelligently pre-load information that is expected to be used in the near future.
Pointing out existing tech that does similar thing:
RSS readers load feeds in background, with assumption that user will want to read them sooner or later. There's no probability function that selects feeds for download though, user explicitly selects them
Browser start page and pinned tabs: these load as you start your browser, again user gets to select which websites are worth having around all the time
Your proposal boils down to predicting where user is most likely to click next given current website and current time of day. I can think of few other factors that play role here:
what other websites are open in tabs ("user opened song in youtube, preload lyrics and guitar chords!")
what other applications are running ("user is looking at invoice in e-mail reader, preload internet bank")
which person is using the computer--use webcam to recognize faces, know which sites each one frequents

What is a good algorithm for showing the most popular blog posts?

I'm planning on developing my own plugin for showing the most popular posts, as well as counting how many times a post has been read.
But I need a good algorithm for figuring out the most popular blog post, and a way of counting the number of times a post has been viewed.
A problem I see when it comes to counting the number of times a post has been read, is to avoid counting if the same person opens the same post many times in a row, as well as avoiding web crawlers.
Comes in the form of a plugin. No muss, no fuss.
'Live' counters are easily implementable and a dime a dozen. If they become too cumbersome on high traffic blogs, the usual way is to parse webserver access logs on another server periodically and update the database. The period can vary from a few minutes to a day, depending on how much lag you deem acceptable.
There are two ways of going about this:
You could consider the individual page hits [through the Apache/IIS logs] and use that
Use Google Page rank to emphasize pages that are strongly linked to [popular posts would no longer be based on visits but on the amount of pages that link to it]

What is the general consensus on preemptively loading images on web pages?

So I've been wondering this for a while, I'm currently building a website which is very image oriented. What do people think of preloading images? How do they do it? (Javascript versus display:none css?).
As users what do you think of it? Does the speed gained while using the website justify the extra time you have to spend waiting for it to load?
From a programmer's stand point, what is better practice?
If you really have to preload (e.g. for rollover images), definitely use CSS. JavaScript can be blocked, and you can't rely on it.
Rather than pre-loading multiple images, I recommend you use CSS Sprites:
Which is a technique where you consolidate multiple images into a single image (and use background-position to select the correct portion) to reduce the number of HTTP requests made to the web server and reduce the overall page load time.
