I was wondering how the windows host-name resolution system works.
More precisely I wonder about the use, or lack thereof, of local caching in the process.
According to Microsoft TCP/IP Host Name Resolution Order, the process is as follows:
The client checks to see if the name queried is its own.
The client then searches a local Hosts file, a list of IP address and names stored on the local computer.
Domain Name System (DNS) servers are queried.
If the name is still not resolved, NetBIOS name resolution sequence is used as a backup. This order can be changed by configuring the NetBIOS node type of the client.
What I was wondering is, whether stage (2) is cached in some way.
The sudden interest arose this last few days, as I installed a malware protection (SpyBot) that utilizes the HOSTS file. In fact, it is now 14K entries big, and counting...
The file is currently sorted according to host name, but this of course doesn't have to be.
lg(14K), means 14 steps through the file for each resolution request. These request probably arrive at a rate of a few every second, and usually to the same few hundred hosts (tops).
My view of how this should work is like this:
On system startup the windows DNS-resolution mechanism loads the HOSTS file a single time.
It commits a single iteration over it that sorts file. A working copy is loaded into memory.
The original HOSTS file, will not be further read throughout the resolution's process' life.
All network-processes (IE, Firefox, MSN...) work via this process/mechanism.
No other process directly interfaces/reads HOSTS file.
Upon receiving a name resolution request, the process check its memory-resident cache.
If it finds the proper IP then is answers appropriately.
Otherwise (it's not cached), the resolution process continues to the memory resident (sorted) HOSTS file, and does a quick binary search over it. From here on, the process continues as originally described.
The result of the resolution is cached for further use.
Though I am not sure as to the significance of these, I would really appreciate an answer.
I just want to see if my reasoning is right, and if not, why so?
I am aware that in this age of always-on PCs the cache must be periodically (or incrementally) purged. I ignore this for now.
In the DNS Client service (dnsrslvr) you can see a function called LoadHostFileIntoCache. It goes something like this:
file = HostsFile_Open(...);
if (file)
{
while (HostsFile_ReadLine(...))
{
Cache_RecordList(...);
...
}
HostsFile_Close(...);
}
So how does the service know when the hosts file has been changed? At startup a thread is created which executes NotifyThread, and it calls CreateHostsFileChangeHandle, which calls FindFirstChangeNotificationW to start monitoring the drivers\etc directory. When there's a change the thread clears the cache using Cache_Flush.
Your method does not work when the ip address of a known hostname is changed in hosts without adding or changing the name.
Technet says that the file will be loaded into the DNS client resolver cache.
IMO this is mostly irrelevant: A lookup in a local file (once its in the disk cache) will still be several orders of magnitude faster than asking the DNS servers of your ISP.
I don't think that each process maintains it's own cache. If there is a cache, it probably exists in the TCP/IP stack or kernel somewhere, and even then, only for a very short while.
I've had situations where I'll be tinkering around with my hosts file and then using the addresses in a web browser and it will update the resolved names without me having to restart the browser.
Related
I have Nginx running in a Docker container, and it serves some static files. The files will never change at runtime - if they actually do change, the container will be stopped, the image will be rebuilt, and a new container will be started.
So, to improve performance, it would be perfect if Nginx would read the static files only one single time from disk and then server it from memory forever. I have found some configuration options to configure caching, but at least from what I have seen none of them provided this "forever" behavior that I'm looking for.
Is this possible at all? If so, how do I need to configure Nginx to achieve this?
Nginx as an HTTP server cannot do memory-caching of static files or pages.
Nginx is a capable and mature HTTP and proxy server. But there seems to be some confusion about its capabilities with respect to caching. Nginx server cannot memory-cache files when running as a pure Web server. And…wait what!? Let me rephrase: Nginx HTTP server cannot memory-cache files or pages.
Possible Workaround
The Nginx community’s answer is: no problem, let the OS do memory caching for you! The OS is written by smart people (true) and knows the what, when, where, and how of caching (a mere opinion). So, they say, cat your static files to /dev/null periodically and just trust it to cache your stuff for you! For those who are wondering and pondering, what’s the cat /dev/null reference has to do with caching? Read on to find out more (hint: don’t do it!).
How does it work?
It turns out that Linux is a fine-tuned beast that’s hawk-eyed about what goes in and out of its cache thingy. That cache thingy is called the Page Cache. The Page Cache is the memory store where frequently-accessed files are partially or entirely stored so they’re quickly accessible. The kernel is responsible for keeping track of files that are cached in memory, when they need to be updated, or when they need to be evicted. The more free RAM that’s available the larger the page cache the “better” the caching.
Please refer below diagram for more depth explanation:
Operating system does in memory caching by default. It's called page cache. In addition, you can enable sendfile to avoid copying data between kernel space and user space.
Currently our users have to enter http://biglongservername:9000/sonar in order to access our site. Can it be configured to correspond to http://sonar? Our DNS guys say that can't do any more than change the CNAME so that pinging "sonar" takes you to biglongservername.domainname.org, which doesn't help our users much, but might be a start. Is this possible?
There are 3 parts to this:
DNS configuration to alias http://biglongservername to http://sonar. Your DNS guys have already said they can make this happen for you, so take them up on it. The address becomes http://sonar:9000/sonar
Dropping the :9000 from the address. This is a matter of SonarQube configuration. In $SONARQUBE_HOME/conf/sonar.properties, set sonar.web.port to 80, the default port. Restart. The address becomes http://sonar/sonar
Dropping the "/sonar" from the end of the address. This is again a matter of configuration. In $SONARQUBE_HOME/conf/sonar.properties (yes, the same file) comment out sonar.web.context. Restart. The address becomes http://sonar.
Note that I would test each of these steps before moving on to the next one. And while step #1 can happen transparently to your users, they will certainly notice steps #2 and #3. You may want to set up a brief outage window.
I have a Rails app that reads from a .yml file each time that it performs a search. (This is a full text search app.) The .yml file tells the app which url it should be making search requests to because different version of the search index reside on different servers, and I occasionally switch between indexes.
I have an admin section of the app that allows me to rewrite the aforementioned .yml file so that I can add new search urls or remove unneeded ones. While I could manually edit the file on the server, I would prefer to be able to also edit it in my site admin section so that when I don't have access to the server, I can still make any necessary changes.
What is the best practice for making edits to a file that is actually used by my app? (I guess this could also apply to, say, an app that had the ability to rewrite one of its own helper files, post-deployment.)
Is it a problem that I could be in the process of rewriting this file while another user connecting to my site wants to perform a search? Could I make their search fail if I'm in the middle of a write operation? Should I initially write my new .yml file to a temp file and only later replace the original .yml file? I know that a write operation is pretty fast, but I just wanted to see what others thought.
UPDATE: Thanks for the replies everyone! Although I see that I'd be better off using some sort of caching rather than reading the file on each request, it helped to find out what the best way to actually do the file rewrite is, given that I'm specifically looking to re-read it each time in this specific case.
If you must use a file for this then the safe process looks like this:
Write the new content to a temporary file of some sort.
Use File.rename to atomically replace the old file with the new one.
If you don't use separate files, you can easily end up with a half-written broken file when the inevitable problems occur. The File.rename class method is just a wrapper for the rename(2) system call and that's guaranteed to be atomic (i.e. it either fully succeeds or fully fails, it won't leave you in an inconsistent in-between state).
If you want to replace /some/path/f.yml then you'd do something like this:
begin
# Write your new stuff to /some/path/f.yml.tmp here
File.rename('/some/path/f.yml.tmp', '/some/path/f.yml')
rescue SystemCallError => e
# Log an error, complain loudly, fall over and cry, ...
end
As others have said, a file really isn't the best way to deal with this and if you have multiple servers, using a file will fail when the servers become out of sync. You'd be better off using a database that several servers can access, then you could:
Cache the value in each web server process.
Blindly refresh it every 10 minutes (or whatever works).
Refresh the cached value if connecting to the remote server fails (with extra error checking to avoid refresh/connect/fail loops).
Firstly, let me say that reading that file on every request is a performance killer. Don't do it! If you really really need to keep that data in a .yml file, then you need to cache it and reload only after it changes (based on the file's timestamp.)
But don't check the timestamp every on every request - that's almost as bad. Check it on a request if it's been n minutes since the last check. Probably in a before_filter somewhere. And if you're running in threaded mode (most people aren't), be careful that you're using a Mutex or something.
If you really want to do this via overwriting files, use the filesystem's locking features to block other threads from accessing your configuration file while it's being written. Maybe check out something like this.
I'd strongly recommend not using files for configuration that needs to be changed without re-deploying the app though. First, you're now requiring that a file be read every time someone does a search. Second, for security reasons it's generally a bad idea to allow your web application write access to its own code. I would store these search index URLs in the database or a memcached key.
edit: As #bioneuralnet points out, it's important to decide whether you need real-time configuration updates or just eventual syncing.
I plan to insert some initialization code into OnStart() method of my class derived from RoleEntryPoint. This code will make some permanent changes to the host machine, so in case it is run for the second time on the same machine it will have to detect those changes are already there and react appropriately and this will require some extra code on my part.
Is it possible OnStart() is run for the second time before the host machine is cleared? Do I need this code to be able to run for the second time on the same machine?
Is it possible OnStart() is run for
the second time before the host
machine is cleared?
Not sure how to interpret that.
As far as permanent changes go: Any installed software, registry changes, and other modifications should be repeated with every boot. If you're writing files to local (non-durable storage), you have a good chance of seeing those files next time you boot, but there's no guarantee. If you are storing something in Windows Azure Storage (blobs, tables, queues) or SQL Azure, then your storage changes will persist through a reboot.
Even if you were guaranteed that local changes would persist through a reboot, these changes wouldn't be seen on additional instances if you scaled out to more VMs.
I think the official answer is that the role instance will not run it's Job more than once in each boot cycle.
However, I've seen a few MSDN articles that recommend you make startup tasks idempotent - e.g. http://msdn.microsoft.com/en-us/library/hh127476.aspx - so probably best to add some simple checks to your code that would anticipate multiple executions.
I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.