Better approach to saving Webpage content to database for caching - caching

i want to know which approach is better to saving Webpage content to database for caching?
Using ntext data-type and save content as flat string
Using ntext, but compress content and then save
Using varbinary(MAX) to save content (how i can convert flat string to binary? ;-))
An other approach which you are suggest to me
UPDATE
in more depth i have many table (URLs, Caches, ParsedContents, Words, Hits and etc) which for each url in URLs table i'm sending request and save response into caches table. this is Downloader (URLResolver of Google) section of my engine. then indexer section act was to perform parsing and etc tasks which associated with this. and Compress/Decompress performs only when new content goes to be caching or parsing

The better approach would be to use the built-in caching features in ASP.NET. Searching StackOverflow for [asp.net] [caching] is a good start, and after (or before) that, similar searches on both www.asp.net and Google will get you quite far.
In response to your comment, I would probably save the data as a flat string. It might not be the best option performance-wise when it comes to storage, but if you're going to perform searches on the text content, you don't want to have to compress/decompress or convert to/from binary every time, since there is probably no (easy) way to do this inside SQL Server. Just make sure you have all intexes and full-text features you need set up correctly.

Related

Converting from google translate widget to cloud api using javascript and ajax

We develop and maintain a large number websites which have used the 'old' translate widget for quite some time. Recently, we've undertaken an effort to make all these sites ADA compliant. As it turns out, the widget's implementation is NOT ADA compliant and, it's being deprecated anyway, so our strategy is to move forward and implement the Cloud Translation API.
Many of the site pages are quite large and contain a lot of markup within the body. The body of most site's home pages is in the vicinity of 20KB. Other site pages are probably somewhat smaller. So, rather than doing a POST to an endpoint on the server which would, in turn, post to the api and then have to return the content to the browser, we believe the correct approach is to access the api directly from the browser and clearly, if we were to post the html content of the body, the api should return the body with the markup intact with the translated text.
The only example we've been able to find shows code with a non-ajax $.get(...) translating a short text string. We're wondering if there might be other examples out there which more closely address what we're trying to accomplish.
One other side note: removing the markup from one of these 20KB bodies results in a reduction in size to a bit over 5KB, so potentially doing this could result in a significant cost savings for our clients. If we were to do this by creating an array of strings to translate as part of the post, is it possible to instruct the api to do a batch translate, which would allow us to replace the original strings with the translated ones.
Right now the only available batch requests for translations would be this [1]. This requires the use of cloud storage, where the files should be and where the translated files go. As per your explanation, I am unsure if this could be of use for you.
I have found this post [2] which has a workaround that may be of use for you if it is possible for you to concatenate what needs to be translated. Basically, the workaround would be creating a string which is a concatenation of the strings that need to be translated and split it once it is translated based on a delimiter value.
[1] https://cloud.google.com/translate/docs/advanced/batch-translation
[2] Bulk translation of a big set of records via google translate

Aspose Markup fields too wide

I'm using the reporting engine for aspose and everything is working fine.
The issue I have is that my document model has some larger names in it. I didn't build the model and I would rather not create a new one just for reporting, but if that's my only option I will. Thought I'd check here first.
Example:
<<[NatureOfInjury]>><<[NatureOfInjuryOptionA]>>
Hurt Hit thumb with hammer
The markup for <<[NatureOfInjury]>> is wider on the word document than the value that will end up going in there, and it's making formatting the document difficult.
Is there any way other than changing the object model to make the markup smaller, independent of the actual text values that will go in there?
Thanks very much in advance.
Unfortunately, there is no other way to populate template tag with data source field name using LINQ Reporting. You need to use the same name in template document and data source.
I work with Aspose as Developer Evangelist.

Skip common/duplicate parts while indexing web pages with ElasticSearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

Should JSON data contain formatted data?

When using JSON to populate a section of a page I often encounter that data needs special formatting - formatting that need to match that already on the page, which is done serverside.
A number might need to be formatted as a currency, a special date format or wrapped in for negative values.
But where should this formatting take place - doing it clientside will mean that I need to replicate all the formatting that takes place on the serverside. Doing it serverside and placing the formatted values in the JSON object means a less generic and reusable data set.
What is the recommended approach here?
The generic answer is to format data as late/as close to the user as is possible (or perhaps "practical" is a better term).
Irritatingly this means that its an "it depends" answer - and you've more or less already identified the compromise you're going to have to make i.e. do you remove flexibility/portability by formatting server side or do you potentially introduct duplication by doing it client side.
Personally I would tend towards client side unless there's a very good reason not to do so - simply because we're back to trying to format stuff as close to the user as possible, though I would be a bit concerned about making sure that I'm applying the right formatting rules in the browser.
JSON supports the following basic types:
Numbers,
Strings,
Boolean,
Arrays,
Objects
and Null (empty).
A currency is usually nothing else than a number, but formatted according to country-specific rules. And dates are not (yet) included in JSON at all.
Whatever is recommendable depends on what you do in your application and what kind of JScript libraries you are already using. If you are already formatting alot of data in your server side code, then add it there. If not, and you already have some classes included, which can cope with formatting (JQuery and MooTools have some capabilities), do it in the browser.
So either format them in the client or format them before sending them over - both solutions work.
If you want to delve deeper into this, i recommend this wikipedia article about JSON.

Dynamically updating RDF File

Is it possible to update an rdf file dynamically from user generated input through a webform? The exact scenario would beskos concept definitions being created and updated through user input to html forms.
I was considering xpath but is there a better / generally accepted / best practice way of doing this kind of thing?
For this type of thing there are IMO two approaches:
1 - Using Named Graphs in a Triple Store
Rather than editing an actual fixed file you use a Graph which is stored as a named graph in a Triple Store that supports triple level updates (i.e. you can change individual Triples in a Graph). For example you could use a store like Virtuoso or a Jena based store (Jena SDB/TDB) to do this, basically any store that supports the SPARUL language or has it's own equivalent.
2 - Using a fixed RDF file and altering it
From your mention of XPath I assume that you are intending to store your file as RDF/XML. While XPath would potentially work for this it's going to be dependent on the exact serialization of your file and may get very complex. If your app is going to allow users to submit and edit their own files then they'll be no guarantees over how the RDF has been serialized into RDF/XML so your XPath expressions might not work. If you control all the serialization and processing of the RDF/XML then you can keep it in a format that your XPath will work on.
From my point of view the simplest way to do this approach is to load the file into memory using an appropriate RDF library, manipulate it in memory and then persist the whole thing back to disk when the user is done (or at regular intervals or whatever is appropriate to your application)

Resources