Hack Manchester Junior

At FARM, we love Hack Manchester. We've supported, attended, and worked behind the scenes in past years, so when we were offered the chance to support Hack Manchester Junior this year, we jumped at the chance.

Jon Atkinson, our Tech Director, will be part of the judging panel for this fantastic event, judging the "Pre-University Challenge" category, of teams made up of students from a single school. FARM has also sponsored some prizes for this category; we've arranged some experience days for the winners, Aqua Zorbing (which looks like too much fun), and group paintballing.

Hack Manchester Junior is a 2 day coding competition for those 18 and under; teams of up to 4 turn up with an idea and have just 2 days to present a working product!

Django Performance Recipes at The London Django Meetup

Django Performance Recipes from Jon Atkinson

A video of this talk is also available at the SkillsMatter page.

Migrating Django projects from Postgres to MySQL

Header Image

We recently needed to move a few of our Django sites from their existing Postgres databases to MySQL.

We didn't use any Postgres-specific database features, so our initial approach was to use MySQL Workbench to migrate the databases directly, but after much tweaking and several failed attempts we realised a more straightforward solution; use JSON as an intermediary format, and Django's fixture system to do the work.

The management command we're after is dumpdata - this extracts data to a neutral JSON representation, that we can then import using loaddata into a new MySQL database.

Though this is a fairly reliable process (certainly much more reliable than converting between SQL formats), it does have a few notable bugs, and it is slow. Django is loading every row in your database into memory, creating the JSON representation, and then writing it to disk, so this approach does take a lot longer than a normal database insert operation, especially if you are dealing with large data sets.

Enough warnings over, on to the process.

First, we need to get our Django settings in place. In your settings file you're likely going to have something like this:

DATABASES = {
    'default': dj_database_url.config(
        default='postgis://postgres@localhost/my-database'
    )
}

You need to add a new MySQL settings:

DATABASES = {
    'default': dj_database_url.config(
        default='postgis://postgres@localhost/my-database'
    ),
    'mysql': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'my-database',
        'USER': 'my-user',
        'PASSWORD': 'my-password',
        'HOST': 'localhost',
    }
}

Once that's complete, we create the tables in the new MySQL database (note that for older versions of Django you'll need to syncdb):

manage.py migrate --database=mysql --no-initial-data

Now, despite the no-initial-data argument, it's worth manually checking new tables are empty, otherwise you might find some foreign key conflicts on import. If you want a quick set of SQL commands to truncate all your tables, use:

manage.py sqlflush --database=mysql

Next up, exporting our data from Postgres. First the command:

manage.py dumpdata --natural --all --indent=2 --exclude=sessions > /tmp/data.json 
  • --natural tells our command to use natural keys.
  • --all ensures we use Django's base manager, to avoid conflicts with custom managers.
  • --indent is optional, but helps make the file a little more human-readable.
  • --exclude again optional, however this can be handy to reduce the export time.

Now that we have a JSON file containing all our data, it's time to get it into our newly prepared database:

manage.py loaddata data.json --database=mysql

Now, all of your data should be available in your new MySQL database exactly as it was before. To make this your new default database it's just a case of changing the default databse in settings over and you're finished.

The above process tends to work fine for straight-forward database schemas, but you might find you have some problems on the loaddata stage if you have some complex foreign key relationships. The most likely cause of these errors is the data restore order, and this is where the exclude argument is required.

It is simple enough to figure out which apps/models need to get imported first based on the error messages; you can then pass in the relevant arguments when dumping your data to import it first, and then follow up with a second export of the rest of the data.

As a guide to the arguments, the below example outputs all model data from the test app other than the Foo model data as well as the UserProfile model from the users app:

manage.py dumpdata --natural --all --indent=2 test users.UserProfile --exclude=test.Foo > data.json 

The Django JSON importer works by creating blank entries in required FK tables when importing, with the relevant foreign key data filled but all other fields blank. This means that the PK will be automatically generated and you'll likely end up with broken foreign keys. The solution:

  • export all the models that don't cause conflicts on import.
  • import them into the new database.
  • manually truncate the offending auto-populated table(s) using something like:

    BEGIN;
    SET FOREIGN_KEY_CHECKS = 0;
    TRUNCATE `profile_userprofile`;
    SET FOREIGN_KEY_CHECKS = 1;
    COMMIT;
    
  • export the remaining models.

  • import those into the new database.

This will be a process of trial and error and might need a couple of extra load/delete steps for more complex set ups. Of course, this can be helped by having a good understanding of the underlying models and their relationships, which should give you a good idea of what's likely to break.

Setting up a DevPi server

Header Image

Downloading 3rd party packages is an important part of our build and deployment process. While PyPI is generally excellent, it does present some risks for us:

  • PyPI does go down from time-to-time. While availability is so much better than it was a few years ago, downtime does still happen.
  • Older Python packages are occasionally removed from PyPI, leaving holes in our requirements.
  • Downloading packages to our servers requires external connectivity, and counts towards our bandwidth usage.

I was encouraged by this article (and the subsequent commentary) to investigate using DevPI as a proxy for PyPI:

The MIT-licensed devpi system features a powerful PyPI-compatible server and a complimentary command line tool to drive packaging, testing and release activities with Python.

Main features and usage scenarios:

fast PyPI mirror: use a local self-updating pypi.python.org caching mirror which works with pip and easy_install. After an initial cache-fill it can work off-line and will re-synchronize when you get online again.

This is a quick walk-through of setting up a DevPI server. We use uWSGI and nginx to serve the application, and we made it available on a host which is available over the private interface of our internal network. It's assumed that the server has a working supervisor daemon running.

First, we need to do some basic setup. We create a folder for the application, and a data folder. Then build a virtualenv, and install the devpi-server package:

$ mkdir /srv/www/devpi/ && mkdir /srv/www/devpi/data/ && cd /srv/www/devpi
$ virtualenv env
$ ./env/bin/pip install -q -U devpi-server

Check that it's installed correctly:

$ ./env/bin/devpi-server --version
2.2.2

Now, generate the configuration files:

$ devpi-server --port 4040 --serverdir ./data/ --gen-config
wrote gen-config/supervisor-devpi.conf
wrote gen-config/nginx-devpi.conf
wrote gen-config/crontab
wrote gen-config/net.devpi.plist
wrote gen-config/devpi.service

Link the supervisor config file into place (switch to a superuser for this):

$ ln -s /srv/www/devpi/gen-config/supervisor-devpi.conf /etc/supervisor/conf.d/99-devpi.conf

Supervisor will need restarting to launch the new process:

$ sudo service supervisor restart

You need to link the nginx configuration file into place (again, as a superuser), and then restart nginx. I edited the hostname directive in the nginx config so it would be addressable with my local DNS:

$ ln -s /srv/www/devpi/gen-config/nginx-devpi.conf /etc/nginx/sites-enabled/99-devpi.conf
$ sudo service nginx restart

Check the nginx configuration is working, and is responding (on your local machine):

$ curl pypi.hostname.com
{
  "type": "list:userconfig",
  "result": {
    "root": {
      "username": "root",
      "indexes": {
        "pypi": {
          "type": "mirror",
          "bases": [],
          "volatile": false
        }
      }
    }
  }
}

Now, we need to setup the root user on this devpi instance. The root user's password is blank by default.

$ devpi use http://pypi.hostname.com
using server: http://pypi.hostname.com/ (not logged in)
no current index: type 'devpi use -l' to discover indices
~/.pydistutils.cfg     : no config file exists
~/.pip/pip.conf        : no index server configured
~/.buildout/default.cfg: no config file exists
always-set-cfg: no
$ devpi login root --password ''
logged in 'root', credentials valid for 10.00 hours
$ devpi user -m root password=verysecurepassword
user modified: root
$ devpi logoff

You should setup a regular user account

$ devpi user -c jonathan password=verysecurepassword email=jon@wearefarm.com

Next, login as that user, and tell that user to use the root/pypi repository as default. Any packages requested by this method will be automatically downloaded and cached by your devpi instance. This step will amend your ~/.pip/pip.conf file to reflect this new index, and use it as default form now on.

$ devpi login jonathan
password for user jonathan:
$ devpi use --set-cfg root/pypi

Lets just do a quick check to make sure the index is working, and pip is configured correctly:

$ mkdir /tmp/devpi-test && cd /tmp/devpi-test
$ virtualenv env
New python executable in env/bin/python
Installing setuptools, pip, wheel...done.
$ source env/bin/activate
$ pip install --trusted-host pypi.frm.io Django
Collecting Django
Downloading http://pypi.hostname.com/root/pypi/+f/a5d/397c65a880228/Django-1.8.3-py2.py3-none-any.whl (6.2MB)
100% |████████████████████████████████| 6.2MB 2.7MB/s
Installing collected packages: Django
Successfully installed Django-1.8.3

Note that you need to use the --trusted-host flag to suppress the pip trust warning. Setting up this host over SSL is simple enough, and can be done in the nginx configuration file.

Recycling, reuse, and codebase heuristics

Header Image

Did you know that people tend to recycle unused sheets of paper, but tend to throw used paper in the bin? Or that people are more likely to throw away a crushed drinks can, while they'll recycle a pristine one? It turns out that we make very weird decisions about what to re-use.

During an experiment, marketing professor Remi Trudel noticed a pattern in what his volunteers were recycling versus throwing in the garbage. He then went through his colleagues' trash and recycling bins at Boston University for more data.

He found the same pattern, says NPR's Shankar Vedantam: "Whole sheets of paper typically went in the recycling, but paper fragments went in the trash."

Same type of paper, different shapes, different bins.

Trudel and fellow researcher Jennifer Argo conducted experiments to figure out why that might be. Volunteers received full pieces of paper as well as fragments, and they also received cans of soda.

"After the volunteers had drunk the soda, when the cans were intact, the cans went in the recycling," Vedantam tells Morning Edition host David Greene. "But if the cans were dented or crushed in any way, the volunteers ended up putting those crushed cans in the trash."

It's a really interesting phenomenon. You can read more about it in the original NPR piece, or listen below:

What does this have to do with anything?

It's no secret that the most successful agencies are those which can commit to delivering quality software without crumbling under deadline pressure. We typically have ten to twenty simultaneous projects in our studio, and there are huge levels of potential technical debt, all of the time. We constantly need to make decisions about when to invest time into our core software and infrastructure, and a large part of recognising that investment is in re-use of our existing code.

Since our very early projects at FARM, we have followed a model of build, adapt, extract; where we first build bespoke software, adapt it to fit more than one project, then extract it to it's own repository for future re-use. And, like all software companies, we struggle with this. A given code module might be just too specific to the needs of project for which it was written, or adapting it to serve multiple projects would stretch the code to breaking point. It's a difficult problem to address, but a very important one.

In this environment, where the adapt and extract stages are often initiated by our developers independently of any 'big picture' strategy, it's interesting to consider what drives those decisions. Often we look at code which solves 90% of the problem at hand, yet we choose to rewrite, rather than re-use. Why is this?

I think there are some interesting unconscious decisions people make. In the same way people make snap decisions about what to recycle, I think most programmers develop a mental heuristic for what constitutes 'good' code, and quite often those decisions don't appear to be rational. There are a few things which I look for:

  • Recent activity. Most of the code which we need to evaluate on the fly is open-source, or is available internally via Bitbucket. The very first thing I look at is 'freshness'.
  • Documentation. A sure sign of a poorly maintained module is that is has no documentation, or a generic README which doesn't actually describe the project.
  • Convention. For external packages, I look for good Python packaging practices. I want to install software which is a good citizen, and won't screw up my project namespace. For internal projects, I look for evidence of a proper virtualenv, frozen requirements.txt, and sensible naming conventions.
  • Concepts. We generally deal with Django projects, and looking at the models which are defined in a module is a good indicator of the high-level concepts. If these overlap enough with the problem domain, the re-use is probably going to be beneficial.

A lot of these factors are very easy to assess. In the same way an experienced developer can 'eyeball' a source file, with practice you can 'eyeball' at this level of abstraction, too. The first three items above are easy to develop a gut feeling for (less so the final one), having read enough code. Just like the blank sheet of paper is more likely to be recycled, the package that looks pristine is more likely to be reused.