Managing load balanced production environments
In December we launched a site onto a multi server load balanced production environment. It's certainly the first time I've had to deal with a site that has two web servers and as such, the multi server production environment presented a couple of challenges which I cover in this blog.
Challenge #1: ip_address() was returning the internal address of the server, not the actual user's IP address
This caused us real problems with the user login process, since all failed login attempts were being stored with the same IP address which in turn quickly locked everyone out.
Solution #1:
The fix was simple enough on our environment: add a code block to your settings.php which manually injects the correct address from the HTTP headers:
Challenge #2: The user login form was considered cacheable by Drupal (and therefore Varnish)
Coupled with the above issue, we were getting Varnish cache hits when filling out the login form. This meant that all users were sharing a form_build_id (and therefore the same form cache was being shared for everyone). The upshot was that as soon as anybody entered a valid user name it would be stored for everybody else attempting to login. That in turn meant that flood attempts were all registered against a single account and it would quickly get locked out.
Solution #2:
We don't actually have an answer as to the cause of this yet. It could be something specific to our site or it could be related to other problems with the environment, but the login page is obviously important, so we've added a manual call to
Challenge #3: Views data export, batch processes and temp files
The batch API splits a large job up into smaller jobs, of which each one is processed in a separate HTTP request, held together by an AJAX based page. We are using the views_data_export module to build a relatively large CSV and so enabled the batch mode. What we hadn't considered is that each server has it's own temporary directory, so because the requests are load balanced between the two servers, the CSV was being split roughly into two half's.
Solution #3:
If you're lucky enough to be using the Acquia platform, there is a module for this: https://drupal.org/project/acquia_cloud_sticky_sessions.
Conclusion
Each of these challenges only presented themselves to us once we had made it to the production environment, and once we had a realistic amount of user traffic. This made them all the harder to detect and solve. We're lucky enough to have the support of Acquia in solving the issues quickly but if we were building out this kind of environment ourselves things could have been very different.