Josh Duck

Twitter is CRUD

June 2nd, 2008

Any technical person is interested in solving big challenges. How to scale well (and cheaply) is one of the really big ones; and something that no one really seems to get right. I’ve read some interesting posts discussing Twitter’s problems lately (I’m a little slow to jump on this bandwagon, but like that’s going to stop me). I haven’t been a Twitter-er up until now, but it’s hard to miss talk of them in the blogosphere.

While my first, naive, assumptions were that Twitter was a simple read/write system that could be knocked up in a week, numerous people have pointed out the many challenges that the website faces. TechCrunch has revealed that Twitter handles 3 million messages a day, from 200,000 active users. Many people have pointed out the fact that some Twitter users have upwards of 30,000 followers, and are themselves following that number of members themselves. This lead to the general consensus that Twitter is really just a giant messaging system, not just your basic CRUD setup.

I found this conclusion to be a lot more insightful than the the usual “Rails sux” arguments that are floating around. The view was echoed by a Twitter obsessed friend the other day who asked my opinion on how a hypothetical Twitter clone could be built. He was also of the opinion that the difficult part would be creating a robust messaging system to distribute messages to storage shards and other distribution systems (SMS and any other “push” delivery mechanisms). Viewed from that perspective, it becomes an interesting engineering challenge. I’m sure that a stable system could be built without too many problems.

I was pretty surprised to read today that Twitter is running on a single master MySQL database with just two slaves serving out reads. Ouch. If they ran smoothly then there would be no reason to doubt that the company know what they are doing. However, this is obviously not the case. Even more confusing is an old blog post in which Twitter developers reveal that their biggest problems occur when users with many followers post. I can’t really imagine what setup would account for that. It sounds like a denormalized database contained within a single physical server, which just sounds like lot of effort for little gain.

For a company that has just secured a second, $15 million, round of funding their performance is ridiculous. Out of a over a dozen staff they have only three or four tech guys. If I were investing I’d be asking what the hell is going on. I’m not trying to pretend that Twitter would be an easy fix, I don’t envy the engineers. They have to maintain their current (failing) product while simultaneously pushing to build a completely different system behind the scenes. I’ve been there, and it’s not fun. Apart from having to siphon off precious developer time to keep the old system patched up, they have to face the uphill task of rebuilding everything. Having a huge user-base makes this even more difficult, as they need to get this right first try. No one is going to be happy if an updated version falls over, drops features, or is any way inferior to what they have now. It wouldn’t be surprising to see the new version of their back end systems continuously delayed far into the future.

It should be interesting to see how and if Twitter recovers from all of this. The user base seems to have been fairly forgiving up until now. Even I signed up knowing all the problems they have. Having many bloggers shouting their praise wont hurt them much. Even the many complaints are free press, really. I doubt an stable Twitter would have been written up on TechCrunch quite so much. They do face the threat of someone stealing their glory though. Each day they have outages is a gift to the dozens of Twitter clones that are undoubtedly out there. Just think Friendster; I’m sure they’ll be stable any time now.

A First Look at Python

April 11th, 2008

So I’ve been looking at and using Python recently. I thought I’d share some of my thoughts for those who haven’t had a chance to play with the language yet. I’ll try to avoid a preachy OMG-I’ve-just-discovered-the-best-thing-ever post, or to simply write another Python tutorial. I’ll look at the good and bad points of the language.I first looked at Python a month or two ago. The guy and girls over at programming.reddit.com push it as the language to end all languages, so I decided to grab a copy of the (free!) Dive Into Python book. I started putting together a smallish personal project, but with no external pressure it petered out. When a discussion came up at work (a PHP shop) on how to quickly write a reliable server daemon I pushed the idea of Python. It took a little convincing, but the results speak for themselves.

Picking an Editor

I believe that the editor can make a huge difference to how you perceive a new language. What might be a massively frustrating time-sapping bug in one could be pointed out and corrected be another. Python doesn’t have any kind of officially recommended editor, and it can take a few hours to really get the feel of an IDE. Picking an editor when you don’t even know a language is even worse.

Dive Into Python suggested using Activestate’s Pythonwin. Unfortunately this advice seems to be very dated; the IDE was functional, but basic. I appreciated the in-built debugger and auto complete, but the editor itself seemed a little too rudimentary.

I eventually switched over to Open Komodo, also by ActiveState. The IDE is very good and does a lot of hand-holding that newbies appreciate, like looking after whitespace issues, providing auto-complete and picking up syntax errors. It does have a few drawbacks: it’s an editor, not an IDE, so you will need to run and debug your code outside the editor. It’s missing a few basic features, like a function list, but the problems are fairly minor. It is built on Firefox’s XUL platform so perhaps we’ll see more extensions becoming available in the near future.

Interactive Shell

The number one tool for making Python easy for beginners would have to be the interactive shell. There have been countless times when I’ve just jumped over to the shell to test how certain Python functions work. Even copying and playing with examples in the Dive Into Python book helps you better absorb the information. In other languages testing functionality would mean I’d need to create a new file, add my code, save it to a temporary location and then execute it. The interactive shell actively encourages me to experiment with how object behave, rather than adopting a cargo-cult mentality of just using what has worked in the past.

The White Space Issue

Python’s use of white space seems to be a big issue amongst those that are not familiar with the language. It was one of the objections that I faced when proposing it at work. It tends to put Python in the “weird language” basket, which is unfortunate. After using the language I’d say that the white space issue is largely irrelevant.

It does have some negative effects. It means you’ll have to watch out for tab/space issues, but a good IDE should do that for you. It also make refactoring a little bit more difficult; I like to sometimes comment out a conditional statements, but that is not possible in Python. Sometimes I also like to indent my code for readability, for example if I’m printing out HTML I’ll indent some child elements to indicate that they are related to previous lines. Again, that’s not possible. The biggest issue I see is that there is nothing stopping you from accidentally breaking the flow of your code. If you took a Python code file and removed the whitespace you can loose meaning:

for item in list:

if item.available:

item.update()

item.check_stock()

It’s impossible to tell where the statements should be. Are we calling check_stock on every item, or just the available ones? Sure, this is contrived, but I can see something like this happening.

The advantages of the white space convention become very obvious very quickly. Python code is very compact. Not “what-the-hell-is-this-Perl-code-doing” compact, but actual readable compact non-ugly code. I’ve heard some people describe it as “prose”. That is going a bit far, but it is very neat and easy to read.

Lists, Dicts and Tuples

Python makes working with data sets incredibly easy. It has made me realise how much of my programming is actually just munging sets. Something I’d envisage as the driving part of a module can be converted from a nest of loops and conditional statements into a single line.

Python list comprehension is the magic that makes this happen. What makes it even better is that the code is just as readable, perhaps even more so, than the verbose multi-line version. Python’s syntax makes list invocations feel like a natural extension of for loops, meaning it is a great way to get programmers stuck in the imperative mindset on board.

#Hmmm

new_list = []

for item in old_list:

if item % 2 == 0:

new_list.append(item * 2)

#Yay, list comprehensions sort it all out

new_list = [item * 2 for item in old_list if item % 2 == 0]

I’ve used map and filter functions to do the same thing in the past, but the lambda functions feel like they are out of place when transforming a list.

Syntactic Sugar

People love to repeatedly trot out one or two new features in the blog posts they write when they’ve just discovered a language. I’m no different. However I usually end up look at these contrived examples with a skeptical “but how often do you really use that?” So I’ll do you a favour and share some features that will become second nature to you in Python.

Tuple (and list) unpacking is a really neat feature. It makes a lot of code very concise. In this example the range function returns a list of [0, 1, 2]. The values are unpacked and assigned to the three variables a, b and c.

a, b, c = range(3)

This gives you the cool feature of multiple return values in a way that fits into the language and doesn’t feel bolted on. Even better, it does it without needing to implement some one-off syntax to achieve it. You can use the same unpacking feature anywhere in your code. And yes, you will use it.

data = [(1, 3), (3, 6), (4, 7)]

print [x + y for x, y in data]

#x and y are automatically unpacked

Another neat feature is the way Python treats everything as an object. This means the following code is perfectly valid.

"!" * 5

"Hello world".split()

", ".join(values)

Note that join works on the string variable, not a list as you may expect. This is not as widely used as the tuple unpacking, but does have it’s place with string munging.

Modules

One of the (few) compliments given to PHP is that it has modules for pretty much anything. From my experience, albeit limited, with Python I’d have to say that it deserves the same accolades. The project I have been working on has made use of threads, queues, HTTP servers and clients, config and command line parsers as well as sub processes (including writing to stdin and reading stdout). Everything I’ve wanted has been available as a pre-packaged module. (OK, I lie, I wanted a HTML generator. I ended up downloading markup.py and was up and running in a couple of minutes). Each of the pre-packaged modules seems well written too. Their components can be set up and used with little or no boilerplate and does what I need with very few exceptions.

The actual module system is a welcome sight for a PHP programmer. It’s a lot better than the “throw everything into the global namespace” method I’ve grown used to. It also means you can be sure that your required modules actually exist at runtime, instead of facing the prospect of your program failing mid-execution with “function X does not exist”

Documentation

PHP has spoilt me with its fantastic documentation. Python documentation is adequate, but it could be better. The API reference shows the functions and classes within a module, but could benefit code examples, gotchas and so on. It would also be great to find more links between related parts of the Python documentation. Usually I’ll need to supplement the Python documentation with a Google search for more information. PHP’s comment section is great in this regard. If someone posts to the documentation then they usually have something worth saying.

It’s Guido

One cools thing I’ve noticed when searching for Python examples is that Guido van Rossum’s name keeps appearing all over the place. I don’t know much about Guido, but it’s cool to see a language creator being so heavily involved in his creation. It is a kind of vote of confidence that makes me feel the language is worth learning.

Am I a Convert?

So, Python’s a great language, but am I a convert? Is this the end of PHP for me? Well, no. It’s a bit soon to be making that call. I’ve only used the language for a few weeks, and the first couple of weeks in a new project are always the most productive.

There are still quite a few things I like about PHP too: its documentation, easy integration with Apache and the new OO features are making it much more bearable. My knowledge and experience with PHP is not something I want to throw away on a whim. I know PHP’s strengths and weaknesses. I know exactly how far I can push it before things go bad. That knowledge is not something to underestimate. At this stage Python is still an unknown. I have no idea how will it perform in a web environment or how it will handle itself under a large load.

So, Python is a language that I can choose to use when appropriate. I can honestly say it’s been enjoyable so far and I’ll look forward to learning more.

So You Want More?

If you want to learn Python right now then check out the free Dive Into Python book.

If you are looking for something more lightweight, then the Python in Ten Minutes tutorial is a good one.

Unfortunately I can’t find anything that fills in the gaps between these two resources. If you do know of something then please let me know by leaving a comment below.

Securing Your PHP Code - Databases

April 5th, 2008

SQL injection is a well trodden topic so I won’t go into too much detail.

For those who don’t know, the problem occurs when you fail to properly escape variables being placed into your strings. For example the SQL statement "SELECT * FROM users WHERE name = '$name'" will fail if $name is set to ' or '1' = '1. The string will be expanded to produce SELECT * FROM users WHERE name = '' or '1' = '1'. This is obviously not what you wanted, and could lead to very bad results when coupled with DELETE or UPDATE queries.

Read the rest of this entry »

Securing Your PHP Code - Server Security

April 5th, 2008

When protecting your server environment you’ll want to ensure that two things happen. Firstly, you’ll want to keep your scripts from prying eyes; you want to make sure that you don’t accept input that will break your code. Secondly, and most importantly, you want to stop anyone from executing their own code on your servers.

Read the rest of this entry »

Securing Your PHP Code - XSS

April 5th, 2008

Today I’m going to start a three part series looking at security issues affecting web developers. The specifics apply to PHP developers, but the general concepts carry across all technologies.

Any significant website is going to consist of three core layers: the client side code (HTML and JavaScript), server code (PHP) and a storage layer (MySQL). As a developer you should be aware of the security implications of each layer of technology and how you can best secure your code.

Read the rest of this entry »

Rainbow Tables

February 8th, 2008

Rainbox over Vancouver - by Progie

To most of you the term ‘rainbow table’ is probably familiar. You are probably aware that they are used to aid the reversing of one-way hashes, usually when trying to crack a password. I personally think that they are a nifty little hack, and so I’d like to explain a little about how they are implemented.

Back to Square One

When storing a password, any developer worth their salt is going to hash it using a one-way hashing function. This lets them check that a given password is valid, without risking the possibility that someone could steal their user’s passwords plain text passwords *cough*.

Of course, if that was a foolproof method of protecting passwords then this would be a very short post. If you do happen to find yourself in possession of a hashed password and you want to recover the plain text password then you need to figure our how to crack it. You can’t reverse a hash but you can try and search for a text string that produces an identical hash. The obvious method of attacking a password hash is a brute force attack. You can generate a list of every possible password and hash each until you have output that matches you hashed password.

The simplicity of this solution is defeated by the fact that you are going to be trying a lot of combinations to find your target. Say you’re searching for a five character password containing lowercase letters only. A naïve search is going to produce 26^5 possibilities. That’s 11,881,376 tests to find a very trivial password. Even if you are testing thousands of hashes a second it could take hours to find a match. If you’re clever then you’ll prepare a collection of passwords that require cracking so you can process them all at once. One sweep through all possible passwords will be enough to complete your entire collection.

This reuse of generated hashes leads us to the next obvious improvement; what if we save all possible passwords and hashes in a table and just do a quick lookup whenever we have a hash to check? This seems like an ideal solution at first glance, however when you look more deeply you’ll begin to appreciate exactly how much memory this would require. Each password is 5 bytes and its associated MD5 hash is 16 bytes. The minimum storage required for a naïve lookup table is (16 + 5) bytes x 11,881,376 possibilities. That’s 200MB. Don’t forget that we’re talking about 5 letter passwords here. Any increase in complexity (either in length or possible characters) increases the size requirements exponentially. To crack a secure password would require petabytes of storage for our table. Storing and searching that data efficiently becomes a whole new problem (Google has managed to inadvertently solve this problem).

So there you have it, the two problems we face are time and memory. What we need is a compromise. Wouldn’t it be fantastic if we could only store every thousandth hash, and extrapolate from there? The most obvious problem with that is that hashes are evenly distributed. Two passwords that vary by a single letter produce hashes that are as different as passwords that are completely different. So as it becomes difficult to use the hash/plain text passwords pairs we know to crack hashes we aren’t storing a plain text value for. Rainbow tables get around this in an interesting way.

Read the rest of this entry »