Data Structure for uniquely storing links - data-structures

As part of building a web crawler I have extracted links for visits by the crawler.
What kind of data structure would be suitable for storing each URL with a unique identifier so I before visiting a page I can test to see if the page has already been visited.

An approach: consider unique-identifier is page/url title or some unique hash caculated from url, for example:
URL:
http://stackoverflow.com /questions/18102087/data-structure-for-uniqurly-storing-links
Id: 18102087 OR UNIQUE-HASH (MD5 etc)
Root: http://stackoverflow.com
Other URLs: Root/questions/tagged/java, Root/questions/18102124/mysql-database-using-matlab
Data-structure :
Map [ROOT-URL, Map[ID, URL]]
Fetch/Read :
Given URL, extract ROOT and ID (a string parsing/regex function)
Lookup ROOT, and LOOKUP ID in returned map
Get all URL of a ROOT:
Given URL, extract ROOT and ID
Lookup ROOT
Benefit:
Grouping on root or base URL, can be used for various purpose (say fix-deep structure)
Lessen Hash colisions
Cons:
Memory, maintaining extra ROOT string (say millions times). One Map approach would have only ID and URL
Two lookups instead of one in comparison to single Map approach, but that should be fine as it is HashMap

Probably HashSet is the way to go. In this case each url (or string) is a unique identifier. You can also implement an IEqualityComparer for custom comparison.

Related

Defining right API endpoint REST/RPC

I am developing an API in a microservice for Invoice entity that takes in input a list of Purchase Order Item (i.e. PO Item) identifiers for ex. PO# + productIdentifier together can be used to identify a POItem uniquely. The response of the API is the invoiced quantity of each PO Item.
Input Model -
input GetInvoicedQuantityForPOItemsRequest {
poItemIdentifierList : POItemIdentifierList
}
Structures
list POItemIdentifierList {
member : POItemIdentifier
}
structure POItemIdentifier {
purchaseOrderNumber : String,
productIdentifier : Long
}
Invoiced Quantity of a POItem = SUM of Quantity of Invoice Items created from that PO Item.
Note : A single PO can be used to create multiple Invoices. An Invoice can be created from multiple POs.
I am quite new to REST and so far we have been using RPC endpoints in our legacy service. But now i am building a new service where i am defining endpoints in REST format (for ex. CreateInvoice has been changed to POST /invoice) and I need some suggestions from Stack Overflow community what would be the right approach for defining the REST endpoint of this API or should we keep it in RPC format itself.
RPC endpoint for this API in legacy system : POST /getInvoicedQuantityForPOItems
Our first attempt on REST for this is : POST /invoice/items/invoicedQuantityForPOItems. But this URI does not look like a Noun it is a Verb.
this URI does not look like a Noun it is a Verb.
REST doesn't care what spelling conventions you use for your resource identifiers.
Example: this URI works exactly the same way that every other URI on the web works, even though "it looks like a verb"
https://www.merriam-webster.com/dictionary/post
The explanation is that, in HTTP, the semantics of the request are not determined by parsing the identifier, but instead by parsing the method token (GET, POST, PUT, etc).
So the machines just don't care about the spelling of the identifier (besides purely mechanical concerns, like making sure it satisfies the RFC 3986 production rules).
URI are identifiers of resources. Resources are generalizations of documents. Therefore, human beings are likely to be happier if your identifier looks like the name of a document, rather than the name of an action.
Where it gets tricky: HTTP is an application protocol whose application domain is the transfer of files over a network. The methods in HTTP are about retrieving documents and metadata (GET/HEAD) or are about modifying documents (PATCH/POST/PUT). The notion of a function, or a parameterized query, doesn't really exist in HTTP.
The usual compromise is to make the parameters part of the identifier for a document, then use a GET request to fetch the current representation of that document. On the server, you parse the identifier to obtain the arguments you need to generate the current representation of the document.
So the identifier for this might look something like
/invoicedQuantityForPOItems?purchaseOrder=12345&productIdentifiers=567,890
An application/x-www-form-urlencoded representation of key value pairs embedded in the query part of the URI is a common spelling convention on the web, primarily because that's how HTML forms work with GET actions. Other identifier conventions can certainly work, though you'll probably be happier in the long term if you stick to a convention that is easily described by a URI template.

URL Shortener Algorithm - Remove Duplicates

I've read multiple posts of SO regarding the topic.
How to code a URL shortener?
How do URL shortener calculate the URL key? How do they work?
PHP URL Shortening Algorithm
Every posts recommends to store the url in the database. It will return you the id, pass the id to a hash function, returning a tiny id.
My question is what will happen if the same url is requested to shorten it again? Bitly.com returns the same tiny url again for the same url.
What exactly should be the best way to go forward in order to ensure non-duplicate urls??

How to get context ID and search ID

In Jmeter, I am doing a search and every time it generates a context id,a search id and a session id. I managed to get the session id from the HTTP request and then pass it to the api call. But I don't see context id and search id anywhere and I am not able to pass it to the api. What can I do to get them?
From the chat, we learned that it client generated value.
suggested to use following links to solve the issue, by generating random string/int/alphanumeric/uuid etc. based on the data:
For Random Int: http://jmeter.apache.org/usermanual/functions.html#__Random
For Randome String (alphanumeric) : http://jmeter.apache.org/usermanual/functions.html#__RandomString
For Random values, defined by you: http://jmeter.apache.org/usermanual/functions.html#__RandomFromMultipleVars
For random UUID: http://jmeter.apache.org/usermanual/functions.html#__UUID
the function you choose, depends on what type of data that you are sending.

How to check if a url path exists in the service worker cache

I need to check if a particular URL path exists in the service worker cache.
For example, suppose my URL is:
/myserviceworker/service?a=110&b=70
this URL exists in the cache, but there are many of them with different values of a and b.
Now, suppose I want to refresh all of these URLs, how can I do that?
I want to know how to access the key values from Service Worker cache.
If I know the key, my plan is as follows:
var url = new URL(key);
if(url.pathname === "\/myserviceworker/service")
then refetch the key
But I am not sure how to get the cache key and in what format it is. I mean, is it a string or is it already a URL?
Cache API has a match() method which returns a promise resolving in a Response object if match or undefined if no match exists. The second parameter is an object where you can specify ignoreSearch to not take into account URL parameters.
The ignoreSearch option is actually supported only by Firefox (Chrome status here).
In the other hand, to retrieve all the cache entries, you can use the keys() method.

What to call a method that finds or creates records in db

This question might seem stupid but i nonetheless i believe it's a worth asking questioin.
I work on some web application when we use tags which are attached to articles.
When adding new article users provides a list of tags he wish to associate with his new article.
So when the form is submitted and my MVC controller processes request i have a string with tags from the form. I split the tags string and i get an array of tag words. Then i need a
list o tags ids from db. So i came up with a method that takes a list of tag words and checks if each tag already exists in db. If given tag is already in db then tag's id is appended to result array. If tag does not exist in db it is created and then id of just created tag is appended to result array.
My question is: what is the best name for such a method? This method creates tags only if necessary and returns list of tags' ids.
I tried this names, but none of them looks right to me:
fetchTagsIds(List tagWordsList)
createOrFindsTagsIds(List tagWordsList)
obtainTagsIds(List tagWordsList)
I need a name that really reflects what the method does. Thanks for your help :)
I'd drop the "s" in "Tags". So that the method is:
FetchTagIds(List tagWordsList)
IdsOfTags. That it creates them is an implementation detail (as is the fact that it uses a relational database in the first place). All that matters is that it gives you unique ids for tags (e.g. it could also hash them with a perfect hash function, or lookup an ID in the web.
If you're having trouble coming up with a name that accurately describes what your method does, that could be because you method does too much. Perhaps it would be better to refactor it into two:
findTagsByName(List tagNames) would return a list of Tag objects based on list of names you pass in
persistTags(List tags) - or saveTags(), or createTags() - would persist a list of Tag objects.
getTagIDs?
Get suggests that the code will do something to get an ID when one doesn't exist to fetch.

Resources