How do the web analytics tools work? [closed]

How do the web analytics tools work? [closed] - web-analytics

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am in process of gathering information about web analytics tools (like Google Web Analytics) for my next assignment, but I am not able to find any good information.
I am looking for:
Key terms used.
What all mediums are available for data collection and How they works.
Any reference books, white papers etc (technical and non technical both).
Any open source implementation (especially in .NET).

Here are the key terms used:
Hit (internet)
Page view
Visit / Session
First Visit / First Session
Visitor / Unique Visitor / Unique User
Repeat Visitor
New Visitor
Impression
Singletons
Bounce Rate
% Exit
Visibility time
Session Duration
Page View Duration / Time on Page
Page Depth / Page Views per Session
Frequency / Session per Unique
Click path
Methods used:
Web server logfile analysis
Page tagging
Web server logfile analysis
In this method you write script to scrape details out of your log files and then write it to your database. This method will not give you real time statistics. You can read more about web log analysis software here.
Page tagging
Add a code of javascript or just an image and then use the code to get all the dtails about the page, referrr, visitor etc.
...these were images included in a web
page that showed the number of times
the image had been requested, which
was an estimate of the number of
visits to that page. In the late 1990s
this concept evolved to include a
small invisible image instead of a
visible one, and, by using JavaScript,
to pass along with the image request
certain information about the page and
the visitor. This information can then
be processed remotely by a web
analytics company, and extensive
statistics generated...
If you are using analytics in your own website, you can use the code provided by Eytan Levit
Credit wikipedia. More information can be found there.

Well,
I'm no expert, but here is some common data you can retrieve to build you own analytics:
string str;
str += "Refferer:" + Request.UrlReferrer.AbsolutePath.ToString() + "<BR>";
str += "Form data:" + Request.Form.ToString() + "<br>";
str += "User Agent:" + Request.ServerVariables["HTTP_USER_AGENT"] + "<br>";
str += "IP Address:" + Request.UserHostAddress.ToString() + "<BR>";
str += "Browser:" + Request.Browser.Browser + " Version: " + Request.Browser.Version + " Platform: " + Request.Browser.Platform + "<BR>";
str += "Is Crawler: " + Request.Browser.Crawler.ToString() + "<BR>";
str += "QueryString" + Request.QueryString.ToString() + "<BR>";
You can also parse the keyword the user has reached your website from like this:
protected string GetKeywordFromReferrer(string url)
{
if (url.Trim() == "")
{
return "no url";
}
string urlEscaped = Uri.UnescapeDataString(url).Replace('+', ' ');
string terms = "";
string site = "";
Match searchQuery = Regex.Match(urlEscaped, #"[\&\?][qp]\=([^\&]*)");
if (searchQuery.Success)
{
terms = searchQuery.Groups[1].Value;
}
else
{
Match siteDomain = Regex.Match(urlEscaped, #"http\:\/\/(.+?)\/");
if (siteDomain.Success)
{
site = siteDomain.Groups[1].Value;
}
}
if (terms != "")
{
return terms;
}
if (site != "")
{
return site;
}
return "Direct Access";
}
Hope this has helped a bit.

1. Key terms used
As with answer 1
2. What all mediums are available for data collection and How they works.
Log files from Apache, IIS. HTTP Handlers for ASP.NET, or your actual page. Javascript includes (the objects available to Javascript give you most information you need about the client)
3. Any reference books, white papers etc (technical and non technical both)
The RFC on HTTP is useful, that gives you most of the request headers that are capturable.
4.Any open source implementation (especially in .NET).
I wrote one that has the parsing part of the analysis done (in my view the hardest part). It needs a bit of tweaking in certain areas as it's 4 years old:
Statmagic (for log files)
It's missing a DAL, which is harder than it sounds - the main hurdle is making sure you don't replicate the exact data that each row of the log has, as you then may as well just use the log files. The other part is displaying this aggregated data in a nice format. My goal was to have it stored in SQL Server, and also db4o format to cater for smaller websites.
The 'sad' part of the Statmagic project is Google came along and completely wiped out the competition and any point in me finishing it.

Related

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
##key = key
##link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{##key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{##link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
#result_hash["#{data_name}"] = data_value
else
next
end
end
end
return #result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.

The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.
With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.
You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.
If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.

It is very likely nothing is "blocking" you. As you pointed out
property_page.css("div.cols").css("td.r_align")[4].text
is the problem. So lets focus on that line of code for a second.
Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).
No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.
This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

lookup country name and return flag image to cell in Google Sheets

I have a country list of 245 countries.
Is there any way I can use a VLOOKUP in Google Sheets to import their respective flags?
I was thinking of potentially using a resource such as Wiki or http://www.theodora.com/flags/ but not sure if I can?

Sample file *
Related article
Step 1. Get links
A1 = http://www.sciencekids.co.nz/pictures/flags.html
B1 = //#src[contains(.,'flags96')]
A3 = =IMPORTXML(A1,B1)
Step2. Use image function
B3 = =IMAGE(substitute(A3,"..","http://www.sciencekids.co.nz"))
Bonus. Country name:
C1 = ([^/.]+)\.jpg$
C3 = =REGEXEXTRACT(A3,C1)

Update:
After writing this and doing a bit more curious Googling, I found the following APIs:
https://www.countryflags.io/ (for building a country flag url from a country code)
https://restcountries.eu/ (for getting a country code from a name or partial name)
Which allowed me to create this one-liner formula instead:
=IMAGE(CONCATENATE("https://www.countryflags.io/", REGEXEXTRACT(INDEX(IMPORTDATA(CONCAT("https://restcountries.eu/rest/v2/name/", F3)), 1, 3),"""(\w{2})"""), "/flat/64.png"))
(if anyone knows of a better way to import & parse json in Google Sheets - let me know)
Since these are official APIs rather than "sciencekids.co.nz" it would theoretically provide the following benefits:
It's a bit more "proper" to use a purpose-built API than some random website
Maybe slightly more "future proof"
Availability: more likely to be available in the future
Updated/maintenance: more likely to be updated to include new countries/updated flags
But, big downside: it seems to be limited to 64px-wide images (even the originally posted "sciencekids" solution provided 96px-wide images). So if you want higher-quality images, you can adapt the original formula to:
=IMAGE(SUBSTITUTE(SUBSTITUTE(QUERY(IMPORTXML("http://www.sciencekids.co.nz/pictures/flags.html","//#src[contains(.,'flags96')]"),CONCATENATE("SELECT Col1 WHERE Col1 CONTAINS '/", SUBSTITUTE(SUBSTITUTE(A1, " ", "_"), "&", "and") ,".jpg'")),"..","http://www.sciencekids.co.nz"), "flags96", "flags680"))
which provides 680px-wide images on the "sciencekids.co.nz" site. (If anyone finds an API that provides higher-quality images, please let me know. There's got to be one out there)
Original Post:
To add on to Max's awesome answer, here's the whole thing in a single function:
=IMAGE(SUBSTITUTE(QUERY(IMPORTXML("http://www.sciencekids.co.nz/pictures/flags.html","//#src[contains(.,'flags96')]"),CONCATENATE("SELECT Col1 WHERE Col1 CONTAINS '/", SUBSTITUTE(SUBSTITUTE(A1, " ", "_"), "&", "and") ,".jpg'")),"..","http://www.sciencekids.co.nz"))
(If anyone wants to simplify that a bit, be my guest)
Put this in A2, and put a country name in A1 (eg "Turkey" or "Bosnia & Herzegovina") and it will show a flag for your "search"

Firebase many to many performance

I'm wondering about the performance of Firebase when making n + 1 queries. Let's consider the example in this article https://www.firebase.com/blog/2013-04-12-denormalizing-is-normal.html where a link has many comments. If I want to get all of the comments for a link I have to:
Make 1 query to get the index of comments under the link
For each comment ID, make a query to get that comment.
Here's the sample code from that article that fetches all comments belonging to a link:
var commentsRef = new Firebase("https://awesome.firebaseio-demo.com/comments");
var linkRef = new Firebase("https://awesome.firebaseio-demo.com/links");
var linkCommentsRef = linkRef.child(LINK_ID).child("comments");
linkCommentsRef.on("child_added", function(snap) {
commentsRef.child(snap.key()).once("value", function() {
// Render the comment on the link page.
));
});
I'm wondering if this is a performance concern as compared to the equivalent of this query if I were using a SQL database where I could make a single query on comments: SELECT * FROM comments WHERE link_id = LINK_ID clause.
Imagine I have a link with 1000 comments. In SQL this would be a single query, but in Firebase this would be 1001 queries. Should I be worried about the performance of this?

One thing to keep in mind is that Firebase works over web sockets (where available), so while there may be 1001 round trips there is only one connection that needs to be established. Also: a lot of the round trips will be happening in parallel. So you might be surprised at how much time this takes.
Should I worry about this?
In general people over-estimate the amount of use they'll get. So (again: in general) I recommend that you don't worry about it until you actually have that many comments. But from day 1, ensure that nothing you do today precludes optimizing later.
One way to optimize is to further denormalize your data. If you already know that you need all comments every time you render an article, you can also consider duplicating the comments into the article.
A fairly common scenario:
/users
twitter:4784
name: "Frank van Puffelen"
otherData: ....
/messages
-J4377684
text: "Hello world"
uid: "twitter:4784"
name: "Frank van Puffelen"
-J4377964
text: "Welcome to StackOverflow"
uid: "twitter:4784"
name: "Frank van Puffelen"
So in the above data snippet I store both the user's uid and their name for every message. While I could look up the name from the uid, having the name in the messages means I can display the messages without the lookup. I'm also keeping the uid, so that I provide a link to the user's profile page (or other message).
We recently had a good question about this, where I wrote more about the approaches I consider for keeping the derived data up to date: How to write denormalized data in Firebase

How to bind the Unique Ids of Computer System into Software [duplicate]

This question already has answers here:
Is there really any way to uniquely identify any computer at all
(5 answers)
Closed 7 years ago.
I have a Windows Stand Alone Application software that i need to be password protected.For this i have hashed the unique ids of the system and put into the License text file of the Application folder.
Here is my code to get the License key
public string Value()
{
if (string.IsNullOrEmpty(fingerPrint))
{
fingerPrint = GetHash("CPU >> " + cpuId() + "\nBIOS >> " + biosId() + "\nBASE >> " + baseId()
//+"\nDISK >> "+ diskId() + "\nVIDEO >> " + videoId() +"\nMAC >> "+ macId()
);
}
return fingerPrint;
}
private string GetHash(string s)
{
Label2.Text = s;
MD5 sec = new MD5CryptoServiceProvider();
ASCIIEncoding enc = new ASCIIEncoding();
byte[] bt = enc.GetBytes(s);
return GetHexString(sec.ComputeHash(bt));
}
Now here is my doubt points:
How to make the software valid for some specified period of time.
How to Check the License key and Time Duration validation each time software gets started..
Is my way of approach is correct.
I have tried to implement this in this way because i have heard that windows registry is not the secure way to implement Licensing as any one can easily copy that..
Please help me with your Valuable suggestions.
Thanks in advance.

You now have a key that you can use to hardware lock the license to the machine and that's one steps to copy protect your application.
Now you need to save the key to a file and protect it using a private/public key mechanism to prevent the user from tampering the file. In this file you can also save the time duration and any other info you want.
Here you can find a sample on how to do it using the RSA keys, the SignedXml object and how to validate: http://www.dotnetlicensing.net/

You could also try SmartBind (from Wibu Systems). That would also cover the doubting points you mention. It directly ties your software to mac-adress, ip number, bios, cpu-id, hard drives, sid, basically anything that is present in the end-users system.

Extremly long load time for Ember.js application

I am using Ember.js to build a website for my company.
The problem I am having is that the initial load time of the page is around 10 seconds.
I cant give you the profiling data from chrome because I can't get them out of work.
However what I noticed when looking at them is that there is a function called "Get" which takes in total around 8.5 seconds. I realize this is probably just many uses of Ember.Get(), but still this is just the initial page load.
I don't know if this is normal or not but it's extremely unpleasant. Is there something I can do about this?
Thanks, Jason

try using a production release (the minified version of ember.js), it uses a significantly faster get.
Are you rendering some very large lists? If so look into using List View.
If you have a ton of fields being bound that don't ever change modify them to be unbound.
{{unbound someField}}
If you are having some weird issue where a template is taking a long time, yet you aren't sure which one it is, you can add some timestamp logging to the beginning of your templates to track down the culprit. At the bottom I whipped up a quick helper. In your template you could use it like so. It will print out a timestamp with that data point passed in.
{{logTime this}}
{{logTime name}}
Ember.Handlebars.helper('logTime', function(someField){
var d = new Date,
timestamp = d.toTimeString().replace(/.*(\d{2}:\d{2}:\d{2}).*/, "$1") + "." + d.getMilliseconds();
console.log(timestamp + " - " + text);
return "";
});

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio