Two years ago, I had the chance of introducing version control to a small development team, which have not used any kind of VCS for their code before. This is going to be a little case study on how it worked out after two years.
The company’s primary profile was marketing and media, development was mostly considered a second/third class citizen. Most of the projects were very simple websites, either to advertise a client’s new product, or to serve as the backend for a facebook app, or 90% CRUD for summarizing various surveys. Development time for projects was between 2-8 weeks, and without a written specification.
When I came to work on my first day, the project that was assigned to me was already running late. The lead developer was on vacation, but I remembered him mentioning something about SVN at the interview, so I assumed that’s the VCS in use here. I figured I probably won’t get access to it until he comes back, so I set up an SVN server locally, and began working. After I finished coding, I pushed out all the files to the FTP.
After the lead came back from vacation, I asked about the central SVN server. It turned out that while yes, there is a central SVN server, it’s not really used. The work flow looked like this:
- You get an FTP account to the project’s directory on the production server (eg. project.com)
- Both working, and testing is done there
- After release, you get a different user/pw to a different FTP directory, development continues there (eg. project-dev.com)
- When you are ready to release, copy the files from project-dev.com, to project.com
Needless to say, it was horribly inefficient, and tiresome. Navigating between the files was extremely annoying due to network latency, you had no way of grepping files, and you had no access to server logs. Working in parallel was not possible on the same project, because of the possibility of overwriting each other’s files. The wiser guys were using an FTP client that detected changes, and uploaded the files automatically, but others were uploading the files by hand, each time they saved something. I cringe every time this comes to my mind.
This also makes it impossible to use version control. Now, I’m not going to count all the advantages of that, since the internet is already full of that. The lead dev wanted me to introduce the whole local development, and version control methodology to the team.
So which VCS to pick? The contenders were hg, because I felt home in it, svn because two people have already used it, and git, since it was gaining an amazing amount of traction.
I looked into the central svn server, to see what’s there. It turned out that they really weren’t too serious about it, only a few commits, with messages that were mostly variations of “sddfasfasf”. I crossed svn out, since that much experience can be had in just 20 minutes with any VCS. I pondered on git. If you know hg, you pretty much know git, and vice-versa. I was afraid of the situation that hg might get entirely knocked out by git, since it was spreading so fast among developers. Still, if I am to be teaching others how to use something, I better know it confidently enough, so I settled for hg.
I set up an internal dev server, that will be the authoritative repository, and where we will present our sites for internal testing. I used hgwebdir, with Apache, since it seemed to be easy to set up, and the web interface comes very very handy, when you have to show others what changes you made, since you can just pass the URL of the changeset around, instead of the diffs (eg. How you fixed SQL injections in the code). There are several options, look through their documentation and pick one suitable for you.
I also wrote a small bash script with an html frontend, that let you deploy changes from the central repository, to the (internally) public facing web directories, so after pushing your changes, all you had to do was navigate to the internal deploy site, and push a button there.
I offered the lead to hold some kind of presentation about hg, and distributed version control in general. I’d love to speak at some public conferences in the future, so it would have been a good opportunity to practice speaking. He told me it would be good, but we don’t have time for that, maybe later, people can figure out how to use it. Well, okay.
Fast forward two years
That presentation from the last sentence? He should be slapped for that, and I should have pushed harder. There are some peculiar things about hg, and distributed version control, that is crucial to understand. Depending on the developers, they might get it from the start, but it’s essential to nail a few things down:
The concept of which revision is the parent of the working directory
Basically, what version am I working off of/seeing now. Just because you have pulled from the remote, it does not mean that you will see the changes. You will have to update to the latest (or any other) revision to see them. I had colleagues tell me that code disappeared. This was always tracked down to forgetting to merge. You finish working on something, commit, pull, and update. The code that you just finished have disappeared. This is because you need to merge your changes, into the just pulled revision.
If you have worked on two completely different part of the project, but on the same branch, you still have to merge 1. This is considered normal.
If you have pushed something, that is there to stay
One of the advantages of using version control, is that you cannot lose code, and this goes both ways. If you commit a really big file, or some sensitive data like passwords, it will be really hard to erase it. You have several ways of reverting changes: backout, rollback, strip, and revert-merge.
Backout and revert-merge are safe, but do not actually delete anything, they merely patch a fix on top of the mistake. Rollback and strip are actual deletes, they will remove everything without a trace, but they cannot be executed reliably. There is no method that is both easy to execute, and removes things without a trace.
- Backout will calculate the inverse of a single changeset, and commit that, essentially removing it.
- Rollback will remove your latest, and only your latest commit. You cannot rollback an earlier commit. You also cannot do consecutive rollbacks, but nice try.
- Strip allows you to remove any changeset(s). You will need the mq extension enabled to use it.
- Revert-merge 2 is used when you need to backout multiple changesets.
Strip and rollback will not work if you have already pushed. You will have to repeat the strip/rollback on each, and every single computer that has the project checked out, or ask everyone to do it. Why? Because hg will see that they have a change that the remote server does not have, and will push it, which leads to everyone else pulling it in again. It’s like a virus that keeps popping up, no matter how many times you get rid of it.
So, to safely remove something, you have to use backout, or revert-merge, but the removed code (file, password) will stay in the repo, allowing everyone to update to that revision, and still see it.
Three way merges
Phil Karlton have once said:
There are only two hard problems in Computer Science: cache invalidation and naming things.
I’m tempted to add three way merges to that list. A three way merge is when two (or more) people have started working from a common revision, and they want to merge their changes together. Now, if Mercurial cannot merge it together itself (it is very smart when it comes to merging), you will get popupped with your default merge tool. On Windows this will probably be KDiff3 (it is installed along with TortoiseHG), and it will cause some very honest huh? moments for the first time. Spend some time to get familiar with a merge tool, and show everyone at least once, what to do when it shows up.
Empty directories are not tracked
This is my only real annoyance with hg. You cannot record empty directories, they must have at least one file in them. This also means that if you delete all the files from a directory, the directory will also vanish from hg’s perspective. This is a documented behavior, and last time I checked, the consensus was that you should let your build script handle empty directories. It can be argued with, but whatever. We solved this by placing ‘.readme’ files into the directories, explaining the directory’s purpose.
Ignoring files and directories
I assumed this would be obvious, but there are some files that should not go into version control. These are usually generated files, like what your ORM generates, logs made by your application, or user uploaded files. Another example are the files that IDEs like Eclipse and NetBeans generate. They have a tendency to place junk around in your actual project’s directories, instead of their own internal directories.
I think if you have everything from above explained, you have pretty much covered 80% of the whole version control thing. You can now start to move on to some advanced things like:
I know a lot of people will say that this should also be set up from the beginning. While I agree 100% with this, database versioning is, pardon for the wording, fucking hard to get right. Better leave it until your team are at least somewhat comfortable with version control.
The solutions usually revolve around having your initial database schema under version control, and every time you make a change to it, you have to place the diff (eg.
CREATE TABLE statements) in a separate file, next to the initial db. This is enough, as long as you don’t need to switch back to earlier revisions, or different branches, because then you would need to drop your whole database, and reconstruct it from the diffs. If you can live with this, use this method, because it’s simple and easy.
Better solution, is to have two folders for db changes, one for going forward, and one for going backward. With each change, you also store the inverse for it, eg. if you had to
CREATE TABLE, you also create a change file with a
DROP TABLE in the backwards directory. This way, you can undo any db change, without DROPping your whole database, and recreating it from scratch. In my opinion, this method, with the help of some automation (so you won’t have to apply the db changes by hand), has the best effort required/profit ratio.
There is also some excellent slides by Harrie Verveer, about database versioning, I recommend flipping through it.
Annotate a branch as stable, and make it mandatory, to deploy everything that goes into this branch to production. Always develop in seperate branches from stable, and merge those branches into stable when you are ready to deploy them.
Why is this good:
The features that you develop, no longer depend on each other. If two people use the same branch for two different features, and continously commit-pull-merge, the revisions will depend on each other. When they finish, they can either deploy both features, or none of them, there is no easy way to deploy just one of them. Unlike svn, in hg (or git), you are encouraged to work on separate branches, because merging works as intended.
Having a stable branch also allows you to quickly fix bugs that went out to production. You can update to the stable branch, and fix the bug there, with no worries about the code that is currently being developed will mess you up (since it is not there).
A build/deploy system
Get a build system running. Also, don’t roll your own. In the beginning my simple script was enough, but later on we needed support for applying db changes, creating and chmodding directories, executing scripts like the ORM’s generate proxies task, and it quickly become overwhelming to patch more and more functionality to it. Look into proper build systems like Maven, Jenkins, Phing, when you need to do complex(er) tasks during deploy.
I say people problems, because these usually have to do with people being lazy, or simply not caring, and should be corrected as early as possible.
Some people cannot be arsed to write a commit message. Threaten to fire the idiots who cannot do this after their third warning. Both TortoiseHg, and the CLI version present you with the files that you have changed when committing. It does not take more than a minute (usually less than 10 seconds), to summarize what you did. It helps everyone to see what has already been developed, and speeds up looking for the changeset that could have possibly introduced a bug. It is also really nice to read through the logs, and see that the project is making progress.
This is when someone forgets to commit, works for a week, and commits in everything, in a monolithic changeset.
Lead: Hey, DeveloperA, you have been working for 5 days on feature X, any problems with it?
DeveloperA: Ah no, I already finished it two days ago.
Lead: I don’t see it in the VCS logs.
DeveloperA: Oh wait, I forgot to commit.
DeveloperA: *commits 7 days of work*
There is no way to forget to commit. Committing like this makes it harder for yourself to revert mistakes, because you cannot single out a change, you have to revert everything, and then reconstruct the code without the mistake that you had to revert. Whenever you are done solving a particular problem or task, commit. Don’t leave it to the end of day/week like some chore.
People working on the production server
There was problem with disappearing changes. A bugfix was completed, deployed, it was verified by multiple people to be working, and it reappeared a week later. The developer who made the fix confirmed that the code is indeed in the “unfixed” state, and nobody had a clue why. I asked him if he committed the change into the VCS. He said no, it was only one file with a small change. I had to explain, that if you do not commit something, it will get overwritten at the next deploy.
I’m not in the developers should have no access to production camp. IMO, it is okay to work on the production when you have a good reason, like the whole site whitescreens out with no error message anywhere (common with PHP), or when the production server has something unique in the environment, like you suspect that something might get messed up because of the load balancer, or there is a bug that you cannot reproduce locally. But, after fixing the problem, fix it locally, and push it to the VCS.
I really should have pushed harder for that presentation. It would have spared a lot of man hours to get some misconceptions out of the way early. Also, the build system. We still don’t have anything reasonable running (“we don’t have time for that”), even though it would spare us more hours in the long run, than we would spend getting it up and running.
I’d say it’s working out. We still have some problems with three way merges, but now we can see who made, and what kind of changes to the code, fear is no longer associated with making big changes, we can work in parallel, and we are protected from losing code by a mistyped rm -rf. It’s definitely an improvement, but it could be even better.
- You can also use rebase, if you have the rebase extension enabled. It will reorder your changes, so that it will be on top of the incoming (or any chosen) changeset. ↩
- Revert-merge is when you update to the last changeset that you want to remove, revert everything until the first changeset that you want to remove, and merge everything after the last changeset that you want to remove. I first read about it on Ehsan Akhgari’s blog. ↩