User-Generated Content View Validation - algorithm

I am developing a user-generated content site. The goal is that users are rewarded if their content is viewed by a certain number of people. Whereas a user account is required to post content, an account is not required to view content.
I am currently developing the algorithm to count the number of valid views, and I am concerned about the possibility that users create bots to falsely increase their number of views. I would exclude views from the content generator’s IP, but I do not want to exclude valid views from other users with the same external IP address. The same external IP address could in fact account for a large amount of valid views in a college campus or corporate setting.
The site is implemented in python, and hosted on apache servers. The question is more theoretical in nature, as how can I establish whether or not traffic from the same IP is legitimate or not. I can’t find any content management systems that do this, and was just going to implement it myself.

You cannot reliably do this. Any method you create can be automated.
That said, you can raise the bar. For instance every page viewed can have a random number encoded into a piece of JavaScript that will submit an AJAX request. Any view where you have that corresponding AJAX request is probably a real browser, and is likely to be a real human since few bots handle JavaScript correctly. But absolutely nothing stops someone from having an automatic script to drive a real browser.

Well... you can make them login (through facebook or google id etc, if you don't want to create your own infrastructure). This way it is much easier to track ratings.

Related

What do I need to make a website that references a table of anonymous users to notify using SMS?

This is a project I'm working on for use between people at my university.
The idea is simple, it's a website where people can submit anonymous comments to other people based on a unique identifier, which is just a random number. People sign up with their unique identifier and their phone number, which would be saved together. Other people hop on the website and submit a comment with the unique identifier, which is sent via SMS to the corresponding phone number.
Conceptually I feel like this should be easy, the website just searches a table for the identifier and then uses an SMS API to send a message to the associated phone number. Also dynamically adds new lines to the table as people register.
I am real new to web development (if you couldn't tell), but I'm not afraid of a little code so I'm figuring it out. My problem is I have no idea big-picture-wise what building blocks I need to connect together. I think I found a good service called Twilio for the SMS API. I think I need to pay for web hosting, but do I need to rent server time? It's a real simple operation but the data also needs somewhere to live. I want it to be a long-term installation so I don't want to host it myself.
I would be very grateful if someone could real quick make a shopping list of the components I need to make this happen, or just any other tips if you've got 'em

Can I use recaptcha v3 to verify click traffic?

I have a website where people can interact with different objects to view specific content. I would like to know which objects get the most interactions by real people. For example there are thumbnails of images and I would like to know when a user clicks on a thumbnail to view an image.
To do this I thought I would create a psql table with thumbnail_id and an IP address, where every single view is stored (to ensure every combination of thumbnail and ip is only counted once and people can't just spam click it).
And so every time a click happens, a post request on a /views endpoint with the thumbnail id attached is made in the background.
The proplem is, some people may be incentivized to create bots to auto click certain images with many different IPs.
So I was wondering if I could use recaptcha v3 to identify real users as opposed to bots which would include a token with every view request.
But I was wondering, would is this too much for my backend to handle (since it would have to talk to googles servers every time anybody views an image, which might be every few seconds for each user and I would be billed while the server waits for a response) or be too expensive, since I have to pay google on every request? Or is there some other obvious problem with this?
I'm asking since I have only ever found recaptcha used for single form validation and never for traffic measurements, even though that seems like a pretty obvious use case.

How to analyze large amount of URI logs

I have about 1 million URI logs of user activity on my network, I want to know how many of those 1 million are for Facebook, how many are for Twitter, and so on..
It's easy to link URIs like cdn.xyz.twitter.com , platform.twitter.com to Twitter
However, the problem I'm facing is that I'm not able to link no more than 40% of the URLs captured to real websites, a URL like xys.1234.com can be something in facebook for example, but there isn't a link between that URL and facebook.com domain, thus will just be listed as a stand-alone website, which is wrong (or not what I want).
Also, all API calls won’t also be easily linked to their domains because some websites are maybe using amazon web services and that's what is being logged.
And Many of the URIs are generated from ad services, I want to know where this ad is generated from ( on what website or mobile application did the user click on the ad? ).
Snapshots of URIs so you would understand the whole picture.
https://imgur.com/a/2Ocqi
https://imgur.com/a/bmhNv
So you're trying to match up outgoing requests? How do you expect to know that a user who accessed xyz.1234.com did it through Facebook rather than independently by typing the URL into the address bar? Or by clicking a link from some other page? Your log doesn't contain information that tells you which URLs are linked from which page. Without another source of information, you can't be sure.
You could examine the requests for multiple users and infer relationships. That is, if you notice that all (or a majority of) requests to xyz.1234.com occur after a Facebook request, you can infer that the request occurred as a result of a click on a Facebook page. Doing so will require some interesting pattern matching. How well it works will depend on how much data you have to work with, how well you write the pattern matching, and how much time you're willing to let the algorithm run.
There's no simple answer, though. If you don't have data that explicitly says, "this request was made by clicking on a link from Twitter," then you have to either get another source of information or you have to write code that will infer that information.

Mix Panel API web segmentation and personalisation

Hi I am interested in using Mix Panel on a web site to track customers events. I would like to know if there is any way to use the api to personalise the web site per customer, similar to segmentation for emails.
I would like to query the api for a singular customer asking whether they have achieved several events.
For example something like
If customer has clicked out and last visit greater than a month ago display a banner advert.
Mixpanel does not seem like a correct tool for the job you describe here.
While theoretically this might be possible (via Mixpanel's HTTP API), this will create unnecessary architectural complexity and add extra latency. If you need to customize your web site per user, store any user state in a database like MySQL or PostgreSQL. This will be both faster and easier.

Browser Overlay for Data Entry: Client or Server Side?

I am developing a Django app that functions basically as a data entry tool for websites. The use case has a trusted user or paid technician browsing the web. As they browse they enter data into an overlaid bar similar to what you see on many proxy websites, but containing a form that allows user to write metadata about the website (in this case, training classification data for an ML algorithm) and submit it to my app.
See http://hidemyass.com/proxy/ for an example of a proxy website that inserts an overlay into browsed sites.
I have heard conflicting suggestions on how to approach this.
Serve Websites as Proxy
Pipe all url requests through the django app with something like http://httpproxy.yvandermeer.net/, and rewrite the responses to include the header.
Pros
I can process the responses with sexy scientific libraries like the NLTK
AJAX-free failover. Users can submit human data (albeit with more of a hassle) without the need to submit computed data.
Cons
Greatly increased traffic. Now my webapp has to retrieve all websites and upload them to the user.
Some websites might block proxy requests. My intention is to deploy this on Heroku, but they might frown on an app that generates so many requests.
User Browses in an iFrame
The overlay is separated from the content by an iFrame, and I use javascript to inform the overlay on the page that is currently being browsed
Pros
Distributed Computing. User machines are used to make requests and do any necessary computations. The server is no longer a bottleneck.
Tighter Ajax integration. I can just post a JSON object representative of my entire Model.
Cons
iframes weren't really designed for full-scale browsing. Some websites force themselves out of iframes, and I worry that it won't be a reliable method of browsing.
I don't get to use all those sexy python libraries. My language processing will have to be done in javascript.
Question
I've never done anything like this before. I'm pretty new to all the tools involved, and seriously having trouble choosing between the two very different approaches.
Which method would you suggest? Why? Are there any considerations I have missed?
OKFN's annotator provides imho a good basis for what you are trying to accomplish http://okfn.github.com/annotator/

Resources