How to keep track of the language the user selected - internationalization

What is the best way to keep track of the language a user selected ?
a) storing the language setting in a session variable or cookie
b) reading the preferred language (locale) from the HTTP headers the browser sends
c) appending the language to the URL of every link in the application

d) Store the regional preferences into the user profile.
Of course this would require the whole burden of having profiles and passwords, but this is the best in context of web applications.
That also means that you should offer your log-in page based on the contents of HTTP Accept-Language header, then use the whatever language the user selected in the profile. The regional preferences page should default to the first supported language from the Accept-Language list (so that user may skip selection if it is already OK.)
If you still have doubts that this is the best option, please keep in mind that in the regional preferences page you may store a whole bunch of information such as:
Preferred language (this is what you are asking for)
The country that user lives in
Preferred regional formats (things like date, time and numbers)
The time zone
Preferred writing system (aka script, i.e. Cyrillic vs. Latin)
Measurement system (metric vs. imperial)
Preferred encoding and the format of email messages (text vs. HTML)
Many of these things are simply undetectable or hardly detectable, because web browsers do not send these information. And you have the flexibility of adding more preferences responding to user's requests (i.e. some users may prefer 24 hour time format to 12 hour time format.)
In such case, what you would need to track would be the user session. You may put these preferences to the user session object, so that it will be easily available for any Locale-sensitive class.

B & C
First read the preferred language from the browser headers, then redirect to the appropriate URL.
This way the same URL will not have different contents and confuse users and search engines.

I would recommend you the following:
All three options.
For new users read the locale out of the headers and write it into a cookie.
If the user changes his preferred language you can simply change it in the cookie and redirect to the wanted site.
If you just use B & C users won't be able to change their preferred language
(at least it wouldn't be stored anywhere so they would have to change it on every visit)

Related

Edge browser cookie store location and access

How can I programmatically enumerate and delete Edge browser's cookies?
They don't appear to be among the IE cookies in temporary internet files, and therefore seem not to be returned by the FindFirstUrlCacheEntry/FindNextUrlCacheEntry API calls.
I can see cookie files in
C:\Users\...\AppData\Local\Packages\Microsoft.MicrosoftEdge_8wekyb3d8bbwe\AC\#!001\MicrosoftEdge\Cookies
C:\Users\...\AppData\Local\Packages\Microsoft.MicrosoftEdge_8wekyb3d8bbwe\AC\#!002\MicrosoftEdge\Cookies
C:\Users\...\AppData\Local\Packages\Microsoft.MicrosoftEdge_8wekyb3d8bbwe\AC\MicrosoftEdge\Cookies
What is the distinction of the three directories? How can they be accessed and individual cookies be deleted programatically?
This isn't going to be a perfect answer, but perfect enemy of good etc.
It seems like Edge still uses at least the first two locations. I don't see any recent cookies in the last one. However, maybe that is just coincidence.
I've tried running several windows and tabs to see if different folders get used by different content processes, but I've not been able to figure out much in that department, either.
What I can tell you is the format of these files: they are "*\n" separated cookie collections. Every cookie has a number of fields, which are "\n"-separated.
Edit (2015/12/16): Just stumbled across my own answer here again, and I need to note that some of the cookie field values themselves can end with "*" in which case searching for the "*\n" delimiter will think that the cookie finishes early. No, the values are not escaped (which would make sense...). So your best bet is really to just count the number of lines, which is unfortunate. This was fixed in the first portion of this patch for Firefox, which is present in Firefox 44 and later.
The cookie fields are documented in Firefox's source code:
The cookie file format is a newline-separated-values with a "*" used as delimeter between multiple records.
Each cookie has the following fields:
name
value
host/path
flags
List item
Expiration time most significant integer
Expiration time least significant integer
Creation time most significant integer
Creation time least significant integer
At least, this seems to have been the format in IE, and the format here seems to be so similar that I would be surprised if they were materially different.
I just submitted a patch for using Firefox's existing IE cookie reading code for Edge's cookies, and that seemed to work. Here's the reviewboard review for it, and the revlink in hg.

How to read a file from browsers cache?

I had made a GIF-file with more (hidden) information in it self, then only the picture-data.
Like so:
<?php
// set variabelen
$naam = "gebruikersinformatie";
$info['age'] = 27;
$info['number'] = '1234.56.789';
$info['name'] = 'Arie Noniem';
$info['unique_hash'] = base64_encode(implode("|", array($_SERVER['HTTP_USER_AGENT'],$_SERVER['HTTP_ACCEPT'],$_SERVER['REMOTE_ADDR'],$_SERVER['REMOTE_PORT'],$_SERVER['HTTP_ACCEPT_LANGUAGE'])));
// build the information
$info = base64_encode(http_build_query($info));
// build the image
header('Content-type: image/gif');
echo base64_decode("R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\n".$info);
?>-
Now, i would like to check if the file is already in user's cache.
If not: make a new one (with code above).
If is in cache: read that gif.gif file with PHP, as there is more information stored in it.
Question: how to check if the file is in the cache? How to get the browser to cache images, with php? doesnt work correctly
And how to read the cached file? So PHP gets the real contents from the cached file?
Reason: to avoid EU-cookie-law
I'm afraid you won't be avoiding EU Cookie Law.
Although it's commonly become known as cookie law it is a privacy directive with principles applying to any technology where information is placed, held or read from the user's device. So necessity for compliance includes for instance Flash files (Locally Stored Objects), tracking pixels (invisible one pixel images typically used for tracking email opens) and other stuff too.
Just for reference, this answer is based in our experience in putting together ukcookieslaw.co.uk to deal specifically with the UK implementation of the EU Directive (noticing the German in your coding :-).
Assuming at the least privacy invasive your solution was doing the same as a session cookie and providing a necessary function (like maintaining a log-in) one could argue your solution is actually less compliant, as a session cookie will be (usually) destroyed at latest when the user quits the browser.
Your more obscured, difficult to inspect, deliberately hidden (I appreciate there's no malicious intent) payload can, and given that most people do not empty their cache each time they quit, will hang around for longer. In fact, in a way you're relying on that.
Without the details one can't take a view, but it may be that the information is more available to third parties, i.e. is there a possibility of caching of the image by intermediaries in the network that you would have to protect against?
You would still have to describe your use of personal data, and rely on either implied consent (or explicit consent) for placing data on the user's device for your site's compliance. Problem is that any consent must be INFORMED consent, and it would appear on the face of it that informing the user is furthest from your mind.
I think you need a better reason for your engineering effort :-)
kind regards,
Philip

AJAX every form element?

Is it better-practice to AJAX every form element separately (eg. send request onChange, etc) or collect all the data, then submit with 1 click save?
Essentially, auto-save or user-initiated-save?
I would generally say that a user-initiated save is the way to go for most web-applications. If for nothing else, this is how users are used to interacting with web apps; familiarity and ease of use is extremely important in web applications. Not to mention it can cut down on unnecessary traffic.
This is not to say that auto-saving does not have it's place, but often it can be cause unnecessary traffic. For example, if I am auto-saving a contact form, fill out my name, then email, then back to name to change it, that is already 3 requests that have been sent with no benefit - this is extra work for no added advantage.
Once again, I think it does have a lot to do with your application or where you are planning on using it. Inline edits are something that often uses auto-saving and there I think it is useful, whereas a contact form/signup form would not be a good idea.
I'd say that depends on the nature of your application and whether "auto-save" is a behaviour desired by your users.
"User initiated save" is what a user would expect from their experience with web forms nowadays - I would not deviate from that unless there's a good reason.
Depends on following factors:
What kind of data are you trying to save. E.g. is it okay to be able to save the data partly or you need to save it all at once?
How much data do you want to save? If you have many fields, you might want to send data in chunks (In case of wizards) or save everything at once
Its also a good idea to have data saved (in background) for large forms in a temp way if the user may take a long time to fill in the data (e.g. emails saved as drafts)
It also depends on your web app and the way you have designed your forms. In some forms you may allow certain fields to be modified and saved inplace, so that you can fetch additional data for example
In most cases it would be good to have an explicit "Save" action for your data forms

Date/Time and Internationalization for Enterprise Application -- Development Guidelines

Together with another developer, I have embarked on a journey to create a hosted 'CRM Style' application that will cater to enterprise level businesses. These businesses will be accessing our application remotely and so the hosted nature of the application will require certain features. For example, to guarantee a level of professional service the following things must be true:
internationalization requires multiple languages and presentation of date/time for various timezones and locales
transactional capability for batch processing of tasks and rollback capabilities
security concerns for keeping data safe and remote invocations secure from attack
etcetera, the list goes on and on
Due to these concerns and my role as the developer most responsible for the server side development, I am very interested in the choices I make early on. Regarding timezones and languages for example, are there issues related to my choice of database or data fields? Do I choose to use a UTC timestamp or date field throughout the application and if so is there a standard format for that? Also, regarding different languages, am I supposed to ensure the data is stored in the database as UTF-8 or unicode?
I really want to avoid laying down the infustructure of the system only to discover later that a fundamental decision was incorrect or not big enough, wide enough, smart enough, etc. Can someone point me in the right direction regarding these basic 'early' decisions?
EDIT _ Ok I appreciate the broad responses and now I see my question was a little too non-specific. I'd like to focus on the more specific elements that WERE present in the question, such as how to choose the proper format for storing a UTC Date/Time or how to save my text data (do I specify a UTF format?)
If you are targeting enterprise CRM, then you will need a very high level of customizability and integrations with all kinds of systems. You will make mistakes in the design. Your only hope is to isolate each little piece of the code so that you can have a chance of fixing it later.
In short, basic software engineering principles are your best bet.
What you are discussing is called a multi-tenant application wherein you have the same code base used by multiple customers (tenants) with logical or physical separation of data. Remember the fundamental rule of development: flexibility is relational to complexity. The more flexible you make the system, the more complicated it will be.
RE: UTC
For a CRM application that stores things like when calls were made and when meetings took place, I would definitely store all those in UTC and let the user set their local timezone. However, you might run into dates which are timezone agnostic and for those, I would store whatever date was entered.
RE: Unicode
Yes, I would use Unicode for all user-entered data. However, that will not get you localization. If for a single company for example, you have a user in Hong Kong entering text in Chinese and user in Amsterdam entering text in Dutch, you are not going to get automatic translation. Things like dates and number formats can be localized, but raw text like names used in drop lists and such can be a chore to localize.
As you have not mentioned what you think about the issue, you may find my answer or parts of it rather basic.
If you don't need to, don't use a low-level language. I'd use python usually for the first version of a CRM application (with the hope that it would be good enough for the next versions), but this decision depends also on the domain community.
Try to write the minimal code on your own, instead relying on the third-party libraries. People may disagree on this, but I would write the code myself as the last option. But the next point is important.
When selecting a library/framework to use, make sure the party behind it is going to last, the library is stable and the software license suits you needs.
Other general rules apply: focus on the customer, use continuous integration/testing, etc., use good software practices like logging etc.
Nothing is ever stored as "unicode" because this is an abstract concept. Unicode is always stored in some kind of unicode transformation format (UTF) (well or UCS but I never saw that used somewhere). The most commonly used UTF is UTF-8 but I suggest to use what is native/default to your platform. wikipedia

Techniques to reduce data harvesting from AJAX/JSON services

I was wondering if anyone had come across any techniques to reduce the chances of data exposed through JSON type services on the server (intended to supply AJAX functions) from being harvested by external agents.
It seems to me that the problem is not so difficult if you had say a Flash client consuming the data. Then you could send encrypted data to the client, which would know how to decrypt it. The same method seems impossible with AJAX though, due to the open nature of the Javascript source.
Has anybody implemented a clever technique here?
Whatever the method, it should still allow a genuine AJAX function to consume the data.
Note that I'm not really talking about protecting 'sensitive' information here, the odd record leaking out is not a problem. Rather I am thinking about stopping a situation where the whole DB is hoovered up by bots (either in one go, or gradually over time).
Thanks.
First, I would like to clear on this:
It seems to me that the problem is not
so difficult if you had say a Flash
client consuming the data. Then you
could send encrypted data to the
client, which would know how to
decrypt it. The same method seems
impossible with AJAX though, due to
the open nature of the Javascrip
source.
It will be pretty obvious the information is being sent encrypted to the flash client & it won't be that hard for the attacker to find out from your flash compiled program what's being used for this - replicate & get all that data.
If the data does happens to have the value you are thinking, you can count on the above.
If this is public information, embrace that & don't combat it - instead find ways to capitalize on it.
If this is information that you are only exposing to a set of users, make sure you have the corresponding authentication / secure communication. Track usage as others have said, and have measures that act on it,
The first thing to prevent bots from stealing your data is not technological, it's legal. First, make sure you have the right language in your site's Terms of Use that what you're trying to prevent is actually disallowed and defensible from a legal standpoint. Second, make sure you design your technical strategy with legal issues in mind. For example, in the US, if you put data behind an authentication barrier and an attacker steals it, it's likely a violation of the DMCA law. Third, find a lawyer who can advise you on IP and DMCA issues... nice folks on StackOverflow aren't enough. :-)
Now, about the technology:
A reasonable solution is to require that users be authenticated before they can get access to your sensitive Ajax calls. This allows you to simply monitor per-user usage of your Ajax calls and (manually or automatically) cancel the account of any user who makes too many requests in a particular time period. (or too many total requests, if you're trying to defend against a trickle approach).
This approach of course is vulnerable to sophisticated bots who automatically sign up new "users", but with a reasonably good CAPTCHA implementation, it's quite hard to build this kind of bot. (see "circumvention" section at http://en.wikipedia.org/wiki/CAPTCHA)
If you are trying to protect public data (no authentication) then your options are much more limited. As other answers noted, you can try IP-address-based limits (and run afoul of large corporate proxy users) but sophisticated attackers can get around this by distributing the load. There's also likley sophisticated software which watches things like request timing, request patterns, etc. and tries to spot bots. Poker sites, for example, spend a lot of time on this. But don't expect these kinds of systems to be cheap. One easy thing you can do is to mine your web logs (e.g. using Splunk) and find the top N IP addresses hitting your site, and then do a reverse-IP lookup on them. Some will be legitimate corporate or ISP proxies. But if you recognize a compeitor's domain name among the list, you can block their domain or follow up with your lawyers.
In addition to pre-theft defense, you might also want to think about inserting a "honey pot": deliberately fake information that you can track later. This is how, for example, maps manufacturers catch plaigarism: they insert a fake street in their maps and see which other maps show the same fake street. While this doesn't prevent determined folks from sucking out all your data, it does let you find out later who's re-using your data. This can be done by embedding unique text strings in your text output, and then searching for those strings on Google later (assuming your data is re-usable on another public website). If your data is HTML or images, you can include an image which points back to your site, and you can track who is downloading it, and look for patterns you can use to bust the freeloaders.
Note that the javascript encryption approach noted in one of the other answers won't work for non-authenticated sessions-- an attacker can simply download the javascript and run it just like a regular browser would. Moral of the story: public data is essentially indefensible. If you want to keep data protected, put it behind an authentication barrier.
This is obvious, but if your data is publicly searchable by search engines, you'll both need a non-AJAX solution for them (Google won't read your ajax data!) and you'll want to mark those pages NOARCHIVE so your data doesn't show up in Google's cache. You'll also probably want a white list of search engine crawler IP addreses which you allow into your search-engine-crawlable pages (you can work with Google, Bing, Yahoo, etc. to get these), otherwise malicious bots could simply impersonate Google and get your data.
In conclusion, I want to echo #kdgregory above: make sure that the threat is real enough that it's worth the effort required. Many companies overestimate the interest that other people (both legitimate customers and nefarious actors) have in their business. It might be that yours is an oddball case where you have particularly important data, it's particularly valuable to obtain, it must be publicly accessible without authentication, and your legal recourses will be limited if someone steals your data. But all those together is admittedly an unusual case.
P.S. - another way to think about this problem which may or may not apply in your case. Sometimes it's easier to change how your data works which obviates securing it. For example, can you tie your data in some way to a service on your site so that the data isn't very useful unless it's being used in conjunction with your code. Or can you embed advertising in it, so that wherever it's shown you get paid? And so on. I don't know if any of these mitigations apply to your case, but many businesses have found ways to give stuff away for free on the Internet (and encourage rather than prevent wide re-distribution) and still make money, so a hybrid free/pay strategy may (or may not) be possible in your case.
If you have an internal Memcached box, you could consider using a technique where you create an entry for each IP that hits your server with an hour expiration. Then increment that value each time the IP hits your AJAX endpoint. If the value gets over a particular threshold, fry the connection. If the value expires in Memcached, you know it isn't getting "hoovered away".
This isn't a concrete answer with a proof of concept, but maybe a starting point for you. You could create a javascript function that provides encryption/decryption functions. The javascript would need to be built dynamically, and you would include an encryption key that is unique to the session. On the server side, you'd have an encryption service that uses the key from the session to encrypt your JSON before delivering it.
This would at least prevent someone from listening to your web traffic, pulling information out of your database.
I'm with kdgergory though, it sounds like your data is too open.
Some techniques are listed in Further thoughts on hindering screen scraping.
If you use PHP, Bad behavior is a nice tool to help. If you don't use PHP, it can give some ideas on how to filter (see How it works page).
Incredibill's blog is giving nice tips, lists of User-agents/IP ranges to block, etc...
Here are a variety of suggestions:
Issue tokens required for redemption along with each AJAX request. Expire the tokens.
Track how many queries are coming from each client, and throttle excessive usage based on expected normal usage of your site.
Look for patterns in usage such as sequential queries, spikes in requests, or queries that occur faster than a human could conduct.
Check user-agents. Many bots don't completely replicate the user agent info of a browser, and you can eliminate programatic scraping of your data using this method.
Change the front-end component of your website to redirect to a captcha (or some other human verifying mechanism) once a request threshold is exceeded.
Modify your logic so the respsonse data is returned in a few different ways to complicate the code required to parse.
Obsfucate your client-side javascript.
Block IPs of offending clients.
Bots usually doesn't parse Javascript, so your ajax code won't be instantly executed. And if they even do, bots usually doesn't maintain sessions/cookies as well. Knowing that, you could reject the request if it is invoked without a valid session/cookie (which is obviously set on the server side beforehand by the request on the parent page).
This does not protect you from human hazard though. The safest way is to restrict access to users with a login/password. If that is not your intent, well, then you have to live with the fact that it's a public application. You could of course scan logs and maintian blacklists with IP addresses and useragents, but that goes extreme.

Resources