How to analyze large amount of URI logs - algorithm

I have about 1 million URI logs of user activity on my network, I want to know how many of those 1 million are for Facebook, how many are for Twitter, and so on..
It's easy to link URIs like cdn.xyz.twitter.com , platform.twitter.com to Twitter
However, the problem I'm facing is that I'm not able to link no more than 40% of the URLs captured to real websites, a URL like xys.1234.com can be something in facebook for example, but there isn't a link between that URL and facebook.com domain, thus will just be listed as a stand-alone website, which is wrong (or not what I want).
Also, all API calls won’t also be easily linked to their domains because some websites are maybe using amazon web services and that's what is being logged.
And Many of the URIs are generated from ad services, I want to know where this ad is generated from ( on what website or mobile application did the user click on the ad? ).
Snapshots of URIs so you would understand the whole picture.
https://imgur.com/a/2Ocqi
https://imgur.com/a/bmhNv

So you're trying to match up outgoing requests? How do you expect to know that a user who accessed xyz.1234.com did it through Facebook rather than independently by typing the URL into the address bar? Or by clicking a link from some other page? Your log doesn't contain information that tells you which URLs are linked from which page. Without another source of information, you can't be sure.
You could examine the requests for multiple users and infer relationships. That is, if you notice that all (or a majority of) requests to xyz.1234.com occur after a Facebook request, you can infer that the request occurred as a result of a click on a Facebook page. Doing so will require some interesting pattern matching. How well it works will depend on how much data you have to work with, how well you write the pattern matching, and how much time you're willing to let the algorithm run.
There's no simple answer, though. If you don't have data that explicitly says, "this request was made by clicking on a link from Twitter," then you have to either get another source of information or you have to write code that will infer that information.

Related

Can I use recaptcha v3 to verify click traffic?

I have a website where people can interact with different objects to view specific content. I would like to know which objects get the most interactions by real people. For example there are thumbnails of images and I would like to know when a user clicks on a thumbnail to view an image.
To do this I thought I would create a psql table with thumbnail_id and an IP address, where every single view is stored (to ensure every combination of thumbnail and ip is only counted once and people can't just spam click it).
And so every time a click happens, a post request on a /views endpoint with the thumbnail id attached is made in the background.
The proplem is, some people may be incentivized to create bots to auto click certain images with many different IPs.
So I was wondering if I could use recaptcha v3 to identify real users as opposed to bots which would include a token with every view request.
But I was wondering, would is this too much for my backend to handle (since it would have to talk to googles servers every time anybody views an image, which might be every few seconds for each user and I would be billed while the server waits for a response) or be too expensive, since I have to pay google on every request? Or is there some other obvious problem with this?
I'm asking since I have only ever found recaptcha used for single form validation and never for traffic measurements, even though that seems like a pretty obvious use case.

How do browsers(Firefox more specifically) know which cookies are tracking cookies

I came accross to a situation where Firefox in incognito mode blocks some of the cookies on my site. More specifically google analytics cookies like _ga, _gid, ..etc. Searching in the internet I came across to this article. So browsers like Firefox somehow identify these cookies as tracking. But how? How does it know which cookies are tracking and which not? I need to know this because next time I set cookies on my server I dont want them to be blocked by browsers.
In context of the article it just means blocking reference links. For instance it blocks sending the referral information from, for instance Facebook, to other sites.
Other sites use the referral information to decide who to pay to get more traffic and stuff like that.
There's like 100 different versions of the idea of "tracking" though.
Like the article points out, your ISP always know every DNS search you do and every call to an IP so they always know ALLLL your traffic and are "tracking" it.
There's also "ad tracking" where all those google calls send out what the crawler says is on the page in order to create targeted ads and all that.
I think, based on what you wrote, you're just talking about tracking links which is just scrubbing the referral link part though.
You'd have to be more specific if that's not what you're looking at.

Google Analytics event tracking dependent on source of visit

I am looking to test different traffic patterns within Google Analytics (Direct traffic abnormally high). I was curious if anyone knows how to create an event that fires when source =wildcard To make this event more difficult, this would be set up within Google Tag Manager using Universal Analytics.
I see the 6 event tags but none of them sounds like it would perform my need?
Thanks
Google Tag Manager is not a tracking tool and knows nothing about the traffic source, so no preconfigured macro could be used in a rule to fire tags depending on source.
If you use "classic" asynchronous analytics you can set up a macro that reads the _utmz-cookie and checks in a rule if it contains a source string ("direct","cpc" etc.).
However Universal Analytics determines the traffic source on the server and does not store it clientside, so with UA this would not work.
A few traffic sources are easily recognizable on the respective landing page:
If no referrer is present it's a direct visit/bookmark
if there are campaign (utm) parameters in the url you can use those
if there is a gclid parameter in the url you know it google/cpc
if the referrer is a google domain with a country tld and the parameter "q" is present (will be empty with encrypted search but should still be there) it's an organic google search
if the referrer is a bing domain with the parameter q present it's an organic bing search (and similar for other search engines)
However this will only work on landing pages. You need to write you own cookie to store the source for subsequent pages.
You can refine this approach to give rather similar results to Google Analytics but it will never match perfectly.
One of the most common reasons for abnormal high direct traffic is that no campaign parameters are present in paid traffic, either because you forgot to enable autotagging in your adwords campaigns or because you have redirects that strip out campaign parameters (so paid traffic is lumped together with direct). The above approach would not help you to discover this so I suggest you check this manually first before you do anything else.

User-Generated Content View Validation

I am developing a user-generated content site. The goal is that users are rewarded if their content is viewed by a certain number of people. Whereas a user account is required to post content, an account is not required to view content.
I am currently developing the algorithm to count the number of valid views, and I am concerned about the possibility that users create bots to falsely increase their number of views. I would exclude views from the content generator’s IP, but I do not want to exclude valid views from other users with the same external IP address. The same external IP address could in fact account for a large amount of valid views in a college campus or corporate setting.
The site is implemented in python, and hosted on apache servers. The question is more theoretical in nature, as how can I establish whether or not traffic from the same IP is legitimate or not. I can’t find any content management systems that do this, and was just going to implement it myself.
You cannot reliably do this. Any method you create can be automated.
That said, you can raise the bar. For instance every page viewed can have a random number encoded into a piece of JavaScript that will submit an AJAX request. Any view where you have that corresponding AJAX request is probably a real browser, and is likely to be a real human since few bots handle JavaScript correctly. But absolutely nothing stops someone from having an automatic script to drive a real browser.
Well... you can make them login (through facebook or google id etc, if you don't want to create your own infrastructure). This way it is much easier to track ratings.

How to fill out AJAX form programmatically and scrape results?

Basically, I want to use the Facebook Ads Manager Tool to estimate the number of users targeted by a particular set of targeting parameters. I know there is a published API available, but it is only usable if you are on their advertising application "whitelist." I am sure what I am asking is possible. Plus, it would be interesting to learn more about scraping.
Facebook's Ads Manager Tool is basically an AJAX UI for their ads API. In the process of creating a campaign, you can specify targeting parameters, and the page will dynamically report the number of users targeted as you modify the parameters. From what I've read on the web and here on stackOverflow, it is possible to use Firebug or a similar tool to pick apart what requests are being made by the page and to where, then mimicking these calls to get the information you want.
I'm having trouble interpreting the panels of Firebug. I think the URI I'm trying to send a request to is www.facebook.com/ajax/inventory_estimator.php, though I'm not sure how to form a call.
So, if I want to write a script or program that takes a list of words to use as keywords and returns the estimated number of users for each keyword, how could I do it?
Link to Facebook's Ads Manager Tool, Campaign Creation Page:
http://www.facebook.com/ads/create
yes using an extension like firebug to examine the HTTP requests is a good way to do this.
The Net tab is the one you want (last one).
Have you tried irobotsoft webscraper? It has a good ajax support.
Check their forum here: http://irobotsoft.org/bb/YaBB.pl

Resources