LCP Field Data Anomaly - core-web-vitals

I have an oddity with my LCP field data score - which currently stands at a dire 3.2s - according to https://pagespeed.web.dev/
However, according to every test I run the score is between 0.6s and 1s
I've tried Chrome Lighthouse, online tools like https://www.webpagetest.org/ - and I've tried logging my LCP data to Google Analytics and analysing using https://web-vitals-report.web.app/ - all show an LCP score of less than 1s
I've looked at this:
https://web.dev/debug-web-vitals-in-the-field/#the-web-vitals-report-tool
But there doesn't seem to be a way to show which element is causing the LCP delay in the field?
Any advice gratefully received!

That post you linked details how to gather the LCP element just above the section you linked from: https://web.dev/debug-web-vitals-in-the-field/#usage-with-the-web-vitals-javascript-library.
The next version of web-vitals.js library will include this code in the library itself in an Attribution build that makes it easy to report back elements to the likes of Google Analytics.
If you are not seeing slow LCP in your field data read by web-vitals.js then that is promising and suggests the issue may be fixed. The CrUX data measures over 28-days and it could be just an old issue.
However, it could still be an issue, that is just not showing up. Either because you occasionally get traffic from slower devices and connections, Or because there is something the web-vitals library cannot measure. In particular JavaScript solutions cannot always measure cross-site content accurately (e.g. iframes or images from other domains). More detail in this post: https://web.dev/crux-and-rum-differences/
Unfortunately without some more details (is it origin-level LCP that's an issue or URL-level? What is your LCP element typically? Are the amount of pages passing as given by the CrUX API increasing each day?) it is difficult to give more advice than that.

Related

Efficient way to represent locations, and query based on proximity?

I'm pondering over how to efficiently represent locations in a database, such that given an arbitrary new location, I can efficiently query the database for candidate locations that are within an acceptable proximity threshold to the subject.
Similar things have been asked before, but I haven't found a discussion based on my criteria for the problem domain.
Things to bear in mind:
Starting from scratch, I can represent data in any way (eg. long&lat, etc)
Any result set is time-sensitive, in that it loses validity within a short window of time (~5-15mins) so I can't cache indefinitely
I can tolerate some reasonable margin of error in results, for example if a location is slightly outside of the threshold, or if a row in the result set has very recently expired
A language agnostic discussion is perfect, but in case it helps I'm using C# MVC 3 and SQL Server 2012
A couple of first thoughts:
Use an external API like Google, however this will generate thousands of requests and the latency will be poor
Use the Haversine function, however this looks expensive and so should be performed on a minimal number of candidates (possibly as a Stored Procedure even!)
Build a graph of postcodes/zipcodes, such that from any node I can find postcodes/zipcodes that border it, however this could involve a lot of data to store
Some optimization ideas to reduce possible candidates quickly:
Cache result sets for searches, and when we do subsequent searches, see if the subject is within an acceptable range to a candidate we already have a cached result set for. If so, use the cached result set (but remember, the results expire quickly)
I'm hoping the answer isn't just raw CPU power, and that there are some approaches I haven't thought of that could help me out?
Thank you
ps. Apologies if I've missed previously asked questions with helpful answers, please let me know below.
What about using GeoHash? (refer to http://en.wikipedia.org/wiki/Geohash)

Detecting duplicate webpages among large number of URLs

From a quote in google blogspot,
"In fact, we found even more than 1 trillion individual links, but not all of
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "
How does Google detect those exact duplicate webpages or documents? Any idea on Algorithm that Google uses?
According to http://en.wikipedia.org/wiki/MinHash:
A large scale evaluation has been conducted by Google in 2006 [10] to
compare the performance of Minhash and Simhash[11] algorithms. In 2007
Google reported using Simhash for duplicate detection for web
crawling[12] and using Minhash and LSH for Google News
personalization.[13]
A search for Simhash turns up this page:
https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/
https://github.com/leonsim/simhash
which references a paper written by google employees: Detecting near-duplicates for web crawling
Abstract:
Near-duplicate web documents are abundant. Two such documents differ
from each other in a very small portion that displays advertisements,
for example. Such differences are irrelevant for web search. So the
quality of a web crawler increases if it can assess whether a newly
crawled web page is a near-duplicate of a previously crawled web page
or not. In the course of developing a near-duplicate detection system
for a multi-billion page repository, we make two research
contributions. First, we demonstrate that Charikar's fingerprinting
technique is appropriate for this goal. Second, we present an
algorithmic technique for identifying existing f-bit fingerprints that
differ from a given fingerprint in at most k bit-positions, for small
k. Our technique is useful for both online queries (single
fingerprints) and all batch queries (multiple fingerprints).
Experimental evaluation over real data confirms the practicality of
our design.
Another Simhash paper:
http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf‎
possible solutions
exact methods
1) brute force: compare every new page to all visited pages (very slow and inefficient)
2) calculate hash of every visited page (md5,sha1) and store the hashes in a database and look up every new page's hash in the database
3)standard Boolean model of information retrieval (BIR)
........many other possible methods
near exact methods
1)fuzzy hash
2)latent semantic indexing
....

Webpage updates detection algorithms

First of all, I'm not looking for code, just a plain discussion about approaches regarding what the subject says.
I was wondering lately on how really the best way to detect (as fast as possible) changes to website pages, assuming I have 100K websites, each has an unknown amount of pages, do a crawler really needs to visit each and every one of them once in a while?
Unless they have RSS feeds (which you would still need to pull to see if they have changed) there really isn't anyway to find out when the site has changed except by going to it and checking. However you can do some smart things to be more efficient. After you have been checking on the site for a while you can build a prediction model of when they tend to update. For example: this news site updates every 2-3 hours but that blog only makes about a post a week. This can save you many checks because the majority of pages don't actually update that often. Google does this to help with its pulling. One simple algorithm which will work for this (depending on how cutting edge you need your news to be) is the following of my own design based on binary search:
Start each site off with a time interval ~ 1 day
Visit the sites when that time hits and check changes
if something has changed
halve the time for that site
else
double the time for that site
If after many iterations you find it hovering around 2-3 numbers
fix the time on the greater of the numbers
Now this is a simple algorithm for finding which times are right for checking but you can probably do something more effective if you parse the text and see patterns in times when the updates were actually posted.

What is a good algorithm for showing the most popular blog posts?

I'm planning on developing my own plugin for showing the most popular posts, as well as counting how many times a post has been read.
But I need a good algorithm for figuring out the most popular blog post, and a way of counting the number of times a post has been viewed.
A problem I see when it comes to counting the number of times a post has been read, is to avoid counting if the same person opens the same post many times in a row, as well as avoiding web crawlers.
http://wordpress.org/extend/plugins/wordpress-popular-posts/
Comes in the form of a plugin. No muss, no fuss.
'Live' counters are easily implementable and a dime a dozen. If they become too cumbersome on high traffic blogs, the usual way is to parse webserver access logs on another server periodically and update the database. The period can vary from a few minutes to a day, depending on how much lag you deem acceptable.
There are two ways of going about this:
You could consider the individual page hits [through the Apache/IIS logs] and use that
Use Google Page rank to emphasize pages that are strongly linked to [popular posts would no longer be based on visits but on the amount of pages that link to it]

What searching algorithm/concept is used in Google?

What searching algorithm/concept is used in Google?
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Indexing
If you want to get down to basics:
Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.
Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.
For veteran users:
Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.
In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.
If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.
Ranking
To get down to basics:
Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.
For veteran users:
Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.
Google's patented PigeonRank™
Wow, they initially posted this 7 years ago from Wednesday ...
PageRank is a link analysis algorithm used by Google for the search engine, but the patent was assigned to Stanford University.
I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated.
Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems
Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.
MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.
I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.
While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and
Google’s Personalized Results
This question cannot be answered canonically. The Algorithms used by Google (and other search engines) are their closest guarded secrets and change constantly. Every correct answer can be invalid a month or a year later.
(I know this doesn't really answer the question, but that's the point, there is no possible answer.)

Resources