Monday, January 24, 2011

The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads

The Linux Page Cache and pdflush:Theory of Operation and Tuning for Write-Heavy Loads

As you write out data ultimately intended for disk, Linux caches this information in an area of memory called the page cache. You can find out basic info about the page cache using tools like free, vmstat or top. See

http://gentoo-wiki.com/FAQ_Linux_Memory_Management

to learn how to interpret top's memory information, or atop to get an improved version.

Full information about the page cache only shows up by looking at /proc/meminfo. Here is a
sample from a system with 4GB of RAM:

MemTotal:      3950112 kB
MemFree: 622560 kB
Buffers: 78048 kB
Cached: 2901484 kB
SwapCached: 0 kB
Active: 3108012 kB
Inactive: 55296 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 3950112 kB
LowFree: 622560 kB
SwapTotal: 4198272 kB
SwapFree: 4198244 kB
Dirty: 416 kB
Writeback: 0 kB
Mapped: 999852 kB
Slab: 57104 kB
Committed_AS: 3340368 kB
PageTables: 6672 kB
VmallocTotal: 536870911 kB
VmallocUsed: 35300 kB
VmallocChunk: 536835611 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB


The size of the page cache itself is the "Cached" figure here, in this example it's 2.9GB. As pages are written, the size of the "Dirty" section will increase. Once writes to disk have begun, you'll see the "Writeback" figure go up until the write is finished. It can be very hard to actually catch the Writeback value going high, as its value is very transient and only increases during the brief period when I/O is queued but not yet written.

Linux usually writes data out of the page cache using a process called pdflush. At any moment, between 2 and 8 pdflush threads are running on the system. You can monitor how many are active by looking at /proc/sys/vm/nr_pdflush_threads. Whenever all existing pdflush threads are busy for at least one second, an additional pdflush daemon is spawned. The new ones try to write back data to device queues that are not congested, aiming to have each device that's active get its own thread flushing data to that device. Each time a second has passed without any pdflush activity, one of the threads is removed. There are tunables for adjusting the minimum and maximum number of pdflush processes, but it's very rare they need to be adjusted.

pdflush tunables

Exactly what each pdflush thread does is controlled by a series of parameters in
/proc/sys/vm:

/proc/sys/vm/dirty_writeback_centisecs (default 500): In hundredths of a second, this is
how often pdflush wakes up to write data to disk. The default wakes up the two (or more) active threads every five seconds.

There can be undocumented behavior that thwarts attempts to decrease dirty_writeback_centisecs in an attempt to make pdflush more aggressive. For example, in early 2.6 kernels, the Linux mm/page-writeback.c code includes logic that's described as "if a writeback event takes longer than a dirty_writeback_centisecs interval, then leave a one-second gap". In general, this "congestion" logic in the kernel is documented only by the kernel source itself, and how it operates can vary considerably depending on which kernel you are running. Because of all this, it's unlikely you'll gain much benefit from lowering the writeback time; the thread spawning code assures that they will automatically run themselves as often as is practical to try and meet the other requirements.

The first thing pdflush works on is writing pages that have been dirty for longer than it deems acceptable. This is controlled by:

/proc/sys/vm/dirty_expire_centiseconds (default 3000): In hundredths of a second, how long data can be in the page cache before it's considered expired and must be written at the
next opportunity. Note that this default is very long: a full 30 seconds. That means that under normal circumstances, unless you write enough to trigger the other pdflush method, Linux won't actually commit anything you write until 30 seconds later.

The second thing pdflush will work on is writing pages if memory is low. This is
controlled by:

/proc/sys/vm/dirty_background_ratio (default 10): Maximum percentage of active that can be filled with dirty pages before pdflush begins to write them

Note that some kernel versions may internally put a lower bound on this value at 5%.

Most of the documentation you'll find about this parameter suggests it's in terms of total memory, but a look at the source code shows this isn't true. In terms of the meminfo output, the code actually looks at


MemFree + Cached - Mapped
So on the system above, where this figure gives 2.5GB, with the default of 10% the system actually begins writing when the total for Dirty pages is slightly less than 250MB--not the 400MB you'd expect based on the total memory figure.

Summary: when does pdflush write?

In the default configuration, then, data written to disk will sit in memory until either a) they're more than 30 seconds old, or b) the dirty pages have consumed more than 10% of the active, working memory. If you are writing heavily, once you reach the dirty_background_ratio driven figure worth of dirty memory, you may find that all your writes are driven by that limit. It's fairly easy to get in a situation where pages are always being written out by that mechanism well before they are considered expired by the dirty_expire_centiseconds mechanism.

Other than laptop_mode, which changes several parameters to optimize for keeping the hard
drive spinning as infrequently as possible (see http://www.samwel.tk/laptop_mode/
for more information) those are all the important kernel tunables that control the pdflush threads.

Process page writes


There is another parameter involved though that can spill over into management of user
processes:


/proc/sys/vm/dirty_ratio (default 40): Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice instead of being allowed to do more writes.

Note that all processes are blocked for writes when this happens, not just the one that filled the write buffers. This can cause what is perceived as an unfair behavior where one "write-hog" process can block all I/O on the system. The classic way to trigger this behavior is to execute a script that does "dd if=/dev/zero of=hog" and watch what happens. See Kernel Korner: I/O Schedulers for examples showing this behavior.

Tuning Recommendations for write-heavy operation

The usual issue that people who are writing heavily encouter is that Linux bufferstoo much information at once, in its attempt to improve efficiency. This is particularly troublesome for operations that require synchronizing the filesystem using system calls like fsync. If there is a lot of data in the buffer cace when this call is made, the system can freeze for quite some time to process the sync.

Another common issue is that because so much must be written before any phyiscal writes start, the I/O appears more bursty than would seem optimal. You'll have long periods where no physical writes happen at all, as the large page cache is filled, followed by writes at the highest speed the device can achieve once one of the pdflush triggers is tripped.

dirty_background_ratio: Primary tunable to adjust, probably downward. If your goal is to reduce the amount of data Linux keeps cached in memory, so that it writes it more consistently to the disk rather than in a batch, lowering dirty_background_ratio is the most effective way to do that. It is more likely the default is too large in situations where the system has large amounts of memory and/or slow physical I/O.

dirty_ratio: Secondary tunable to adjust only for some workloads. Applications that can cope with their writes being blocked altogether might benefit from substantially lowering this value. See "Warnings" below before adjusting.

dirty_expire_centisecs: Test lowering, but not to extremely low levels. Attempting to speed how long pages sit dirty in memory can be accomplished here, but this will considerably slow average I/O speed because of how much less efficient this is. This is particularly true on systems with slow physical I/O to disk. Because of the way the dirty page writing mechanism works, trying to lower this value to be very quick (less than a few seconds) is unlikely to work well. Constantly trying to write dirty pages out will just trigger the I/O congestion code more frequently.


dirty_writeback_centisecs: Leave alone. The timing of pdflush threads set by this parameter is so complicated by rules in the kernel code for things like write congestion that adjusting this tunable is unlikely to cause any real effect. It's generally advisable to keep it at the default so that this internal timing tuning matches the frequency at which pdflush runs.

Swapping

By default, Linux will aggressively swap processes out of physical memory onto disk in order to keep the disk cache as large as possible. This means that pages that haven't been used recently ill be pushed into swap long before the system even comes close to running out of memory, which is an unexpected behavior compared to some operating systems. The /proc/sys/vm/swappiness parameter controls how aggressive Linux is in this area.

As good a description as you'll find of the numeric details of this setting is in section 4.15 of
http://people.redhat.com/nhorman/papers/rhel4_vm.pdf
It's based on a combination of how much of memory is mapped (that total is in /proc/meminfo) as well as how difficult it has been for the virtual memory manager to find pages to use.


A value of 0 will avoid ever swapping out just for caching space. Using 100 will always favor making the disk cache bigger. Most distributions set this value to be 60, tuned toward moderately aggressive swapping to increase disk cache.

The optimal setting here is very dependant on workload. In general, high values maximize throughput: how much work your system gets down during a unit of time. Low values favor
latency: getting a quick response time from applications. Some desktop users so favor low latency that they set swappiness to 0, so that user applications are never swapped to disk
(as can happen when the system is executing background tasks while the user is away). That's perfectly reasonable if the amount of memory in the system exceeds the usual working set for the applications used. Servers that are very active and usually throughput bound could justify setting it to 100. On the flip side, a desktop system that is so limited in memory that every active byte helps might also prefer a setting of 100.


Since the size of the disk cache directly determines things like how much dirty data Linux will allow in memory, adjusting swappiness can greatly influence that behavior even though it's not directly tied to that.

Warnings

-There is a currently outstanding Linux kernel bug that is rare and difficult to trigger even intentionally on most kernel versions. However, it is easier to encounter when reducing dirty_ratio setting below its default. An introduction to the issue starts at http://lkml.org/lkml/2006/12/28/171 and comments about it not being specific to the current kernel release are at http://lkml.org/lkml/2006/12/28/131

-The standard Linux memory allocation behavior uses an "overcommit" setting that allows processes to allocate more memory than is actually available were they to all ask for their pages at once. This is aimed at increasing the amount of memory available for the page cache, but can be dangerous for some types of applications. See http://www.linuxinsight.com/proc_sys_vm_overcommit_memory.html for a note on the settings you can adjust. An example of an application that can have issues when overcommit is turned on is PostgreSQL; see "Linux Memory Overcommit" at http://www.postgresql.org/docs/current/static/kernel-resources.html for their warnings on this subject.

References: page cache


Neil Horman, "Understanding Virtual Memory in Red Hat Enterprise Linux 4"

http://people.redhat.com/nhorman/papers/rhel4_vm.pdf

Daniel P. Bovet and Marco Cesati, "Understanding the Linux Kernel, 3rd edition", chapter 15 "The Page Cache". Available on the web at

http://www.linux-security.cn/ebooks/ulk3-html/


Robert Love, "Linux Kernel Development, 2nd edition", chapter 15 "The Page Cache and Page Writeback"

"Runtime Memory Management",

http://tree.celinuxforum.org/CelfPubWiki/RuntimeMemoryMeasurement


"Red Hat Enterprise Linux-Specific [Memory] Information",

http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/admin-guide/s1-memory-rhlspec.html


"Tuning Swapiness",

http://kerneltrap.org/node/3000


"FAQ Linux Memory Management",

http://gentoo-wiki.com/FAQ_Linux_Memory_Management


From the Linux kernel tree:

  • Documentation/filesystems/proc.txt (the meminfo documentation there originally from http://lwn.net/Articles/28345/)
  • Documentation/sysctl/vm.txt
  • Mm/page-writeback.c


References: I/O scheduling

While not directly addressed here, the I/O scheduling algorithms in Linux actually handle the writes themselves, and some knowledge or tuning of them may be synergistic with adjusting the parameters here. Adjusting the scheduler only makes sense in the context where you've already configured the page cache flushing correctly for your workload.

D. John Shakshober, "Choosing an I/O Scheduler for Red Hat Enterprise Linux 4 and the 2.6 Kernel" http://www.redhat.com/magazine/008jun05/features/schedulers/

Robert Love, "Kernel Korner: I/O Schedulers",

http://www.linuxjournal.com/article/6931


Seelam, Romero, and Teller, "Enhancements to Linux I/O Scheduling",

http://linux.inet.hr/files/ols2005/seelam-reprint.pdf


Heger, D., Pratt, S., "Workload Dependent Performance Evaluation of the Linux 2.6 I/O
Schedulers",

http://linux.inet.hr/files/ols2004/pratt-reprint.pdf


Upcoming Linux work in progress


-There is a patch in testing from SuSE that adds a parameter called dirty_ratio_centisecs to the kernel tuning which fine-tunes the write-throttling behavior. See "Patch: per-task predictive write throttling" at http://lwn.net/Articles/152277/ and Andrea Arcangeli's article (which has a useful commentary on the existing write throttling code) at

http://www.lugroma.org/contenuti/eventi/LinuxDay2005/atti/Arcangeli-MemoryManagementKernel26.pdf

-SuSE also has suggested a patch at http://lwn.net/Articles/216853/ that allows setting the
dirty_ratio settings below the current useful range, aimed at systems with very large memory capacity. The commentary on this patch also has some helpful comments on improving dirty buffer writing, although it is fairly specific to ext3 filesystems.

-The stock 2.6.22 Linux kernel has substantially reduced the default values for the dirty memory parameters.dirty_background_ratio defaulted to 10, now it defaults to 5. vm_dirty_ratio defaulted to 40, now it's 10

-A recent lively discussion on the Linux kernel mailing list discusses some of the
limitations of the fsync mechanism when using ext3.

Friday, January 21, 2011

keytool

Java Keytool Commands for Creating and Importing

These commands allow you to generate a new Java Keytool keystore file, create a CSR, and import certificates. Any root or intermediate certificates will need to be imported before importing the primary certificate for your domain.

  • Generate a Java keystore and key pair

    keytool -genkey -alias mydomain -keyalg RSA -keystore keystore.jks -keysize 2048

  • Generate a certificate signing request (CSR) for an existing Java keystore

    keytool -certreq -alias mydomain -keystore keystore.jks -file mydomain.csr

  • Import a root or intermediate CA certificate to an existing Java keystore

    keytool -import -trustcacerts -alias root -file Thawte.crt -keystore keystore.jks

  • Import a signed primary certificate to an existing Java keystore

    keytool -import -trustcacerts -alias mydomain -file mydomain.crt -keystore keystore.jks

  • Generate a keystore and self-signed certificate (see How to Create a Self Signed Certificate using Java Keytool for more info)

    keytool -genkey -keyalg RSA -alias selfsigned -keystore keystore.jks -storepass password -validity 360 -keysize 2048

Java Keytool Commands for Checking

If you need to check the information within a certificate, or Java keystore, use these commands.

  • Check a stand-alone certificate

    keytool -printcert -v -file mydomain.crt

  • Check which certificates are in a Java keystore

    keytool -list -v -keystore keystore.jks

  • Check a particular keystore entry using an alias

    keytool -list -v -keystore keystore.jks -alias mydomain

Other Java Keytool Commands

  • Delete a certificate from a Java Keytool keystore

    keytool -delete -alias mydomain -keystore keystore.jks

  • Change a Java keystore password

    keytool -storepasswd -new new_storepass -keystore keystore.jks

  • Export a certificate from a keystore

    keytool -export -alias mydomain -file mydomain.crt -keystore keystore.jks

  • List Trusted CA Certs

    keytool -list -v -keystore $JAVA_HOME/jre/lib/security/cacerts

  • Import New CA into Trusted Certs

    keytool -import -trustcacerts -file /path/to/ca/ca.pem -alias CA_ALIAS -keystore $JAVA_HOME/jre/lib/security/cacerts

If you need to move a certificate from Java Keytool to Apache or another type of system, check out these instructions for converting a Java Keytool keystore using OpenSSL. For more information, check out the Java Keytool documentation or check out our Tomcat SSL Installation Instructions which use Java Keytool.

Wednesday, January 12, 2011

How To Build A Heartbeat Cluster

How To Build A Heartbeat Cluster

Today we will install and configure a basic high-availability cluster working as a very simple web server. I am using Ubuntu Linux and a VMWare environment for this How-to, just for the sake of simplicity. This howto is meant to give you a working ha-cluster to have a starting point for testing and further research. Please remember: what we install and configure here is not necessarily ready for production. I make some shortcuts one might not want to do in a production environment. This mainly applies to the mechanism for detecting a failed node.

Preparations
We need two identical machines whose only difference is their IP address. Then we also need a third IP address that is used for the highly available service. In our case the service will be a simple Apache web server, running on both cluster nodes.

We create two machines with Ubuntu 8.04 server 64Bit and chose "openssh server" during the installation, nothing else. After installation perform the usual apt-get update, apt-get dist-upgrade. Take care that all usernames and passwords are the same between the two cluster nodes. Give the nodes a static IP address. I gave "hacluster1" the address 192.168.35.81 and node "hacluster2" the 192.168.35.82. Of course you have to adapt the ip addresses to your infrastructure. Now we install the heartbeat software:
apt-get install heartbeat-2 heartbeat-2-gui xauth

For the floating IP address to work we need to append the following line to /etc/sysctl
net/ipv4/ip_nonlocal_bind = 1

Now we're configuring the heartbeat cluster. Edit /etc/ha.d/authkeys (the file doesn't exist yet):
auth 3
3 md5 somerandomstring

after saving it, change the file's permissions: "chmod 600 /etc/ha.d/authkeys". The file defines how the communication between cluster nodes is authenticated.

Next file to edit is "/etc/ha.d/ha.cf" (the file might not exist yet):
logfacility local0
node hacluster1 hacluster2
bcast eth0
crm on

The second line defines which machines are part of the cluster, thus "hacluster1" and "hacluster2" should be hostnames. "bcast eth0" tells the heartbeat software to communicate with the other nodes via broadcast packets on eth0.

Last thing to do is to set a password for user hacluster on both machines. Now we have a readily configured cluster of two nodes and we should log into the VMWare control center to make snapshots of each node. Thus we can go back to a vanilla cluster everytime we want.

Highly available web server.
Now that the cluster is ready for work we also need a service to be managed by the cluster. For the sake of simplicity we will install an Apache web server on both nodes: after installing the server software with "apt-get install apache2" remove the symlink for apache in /etc/rc2.d. On a normal server machine the web server is started automatically at system start-up via these symlinks. But on a cluster only the cluster software is responsible for starting and stopping the "clustered" services.

Edit "/var/www/index.html" and change "It works" for "hacluster1" on the first node and "hacluster2" on the second. Thus we can easily see in the browser from which node the web pages are being served.

Now set the password for user hacluster, else we cannot log into the gui. Just chose any password you like.

Everything you did until now had to be done on each of the nodes. But now that the cluster is prepared, the remaining configuration is done only once and will be propagated among the nodes automatically. Log into one of the nodes from you local X11 xterm and start "hb_gui", connect to 127.0.0.1, user hacluster and the password you've chosen. Remember: if you want to use a remote X11 app, you have to log in from a local xterm with "ssh -XC ". On Mac OS you would need to install X11, the Terminal won't do. Under Windows you would need something like Cygwin, a mere Putty won't do neither.

OK, we're logged into the cluster gui. While the cluster is generally working, it doesn't do anything at the moment, there's nothing yet configured. What we'll do is to configure a highly available web server cluster, where either note on or two will serve static web pages.

First we need to define a shared IP address for the web server. So right-click on "Resources", chose "Add new item", leave type as "native". Now scroll down in the list and chose "IPaddr" in the column "Name". In the "Parameters" field below, type the shared IP address into the "Value" field, hit RETURN and then click on "Add" (lower right):


The second item will be the apache resource itself: in the main gui window right-click on "Resource" again and "Add new item", type "native". Chose "apache2" from the list, no additional parameters needed. Click "Add" in the lower right.

We could now start the resources and the web service would work already, serving from either node1 or node2. But to have a well configured cluster, we need the Apache service to be "bound" to the IP address resource, so that the Apache is always running on the same node where its IP address is running. So we create a co-location: right-click on the "Colocations" entry in the "Constraints" list, chose "colocation", give something meaningful as "ID". Chose your IP resource as "From" and the Apache resouces as "To", leave the score as "INFINITY", click "OK".

We also have to make sure that the Apache is always started on a particular node only after its IP address has been activated, else the web server might not work. Thus we need an "Orders" rule: Right-click on "Orders" in the constraints list, "Add new item", leave type as "order". As "From" chose the IP address resource, leave "Type" as "before" and chose the Apache resource as "To". Click "OK".

Now everything is ready to be started: right-click on both resources and select "start".


If you now try to view the web site in your browser, it should either show "hacluster1" or "hacluster2". Let's test the fail-over process: right-click on the "cd" node, chose "standby". You should see the two resources quickly moving to the other node. If you now reload the page in your browser, it should show the other node than before. Once you switch the standby node back to "active", the resources are moving back as well.

That's all for now. You have a working cluster as a start for further testing and research. The clustered Apache in this example would be useful in a production environment only if it serves completely static content that doesn't change frequently. And you have to make sure that both the Apache configurations as well as the web files are identical on both nodes.

But as almost all web server these days are using databases and their content is updated very often (like this blog, ahemm), the clustered Apache as described her doesn't make much sense. More to it in the next instalment.


httpCOLON
SLASHSLASHblogDOTtaggesellDOTdeSLASHindexDOTphp?SLASHarchivesSLASH83-How-To-Build-A-Heartbeat-ClusterDOThtml