For many of them, if you look at them, you can see that if you knew they were handwritten digit you would probably guess they were 2's, but it's very hard to say what it is that makes them 2's. There is nothing simple that they have in common. In particular, if you try and overlay one on another, you will see they don't fit each other. Even if you skew them, it's very hard to make them overlay on each other. So a template isn't going to do the job, and in particular a template is going to be very hard to find that will fit those 2's. And then take the 2's that really are harder to guess at. They definitely wouldn't fit in the template. So this is the thing that makes recognizing handwritten digits a good task for machine learning.
Now I don't want you to think that's the only thing we can do with machine learning. It's a relatively simple thing for a machine learning system to do now. So let's talk about some examples of much more difficult things. We now have neural nets that have a hundred million parameters in them that can recognize a thousand different object classes in 1.3 million high-resolution training images gotten from the web.
There was a competition, and the best system got a 47% error rate in identifying the images. Jitendra Malik, who is an eminent neural net skeptic and a leading computer vision researcher (see https://people.eecs.berkeley.edu/~malik/ for more about him) has said that this competition is a good test of whether the deep neural networks can work well for object recognition. A very deep neural network can now do considerably better than those in the competition. It can get less than 40% error for its first choice and less than 20% error for its top 5 choices.
Another task that neural nets are very good at is speech recognition. Speech recognition systems have several stages. First they pre-process the sound wave to get vector acoustic coefficients for each 10 ms of sound waves. That means they get a hundred of those vectors per second. They then take a few adjacent vectors acoustic coefficients, and they need to place bets on which part of which phoneme is being spoken. They look at this little window, and they say in the middle of this window, what do I think the phoneme is, and which part of the phoneme is it? And a good speech recognition system will have many alternative models for each phoneme, and each model might have three different parts. So it might have many thousands of alternative fragments that it thinks this might be.
And it has to place bets on all those thousands of alternatives. Once it has placed those bets, there is a decoding stage that does the best job it can of using plausible bets, piecing them together into a sequence of bets that corresponds to the kinds of things that people say. Currently deep neural networks pioneered by George Dahl (http://www.cs.toronto.edu/~gdahl/) and Abdelrahman Mohamed (http://www.cs.toronto.edu/~asamir/) at the University of Toronto are doing better than previous machine learning methods for the acoustic model and are beginning to be used in practical systems.
Dahl and Mohamed developed a system that uses many layers of binary neurons to take some acoustic frames and make bets about the labels. They were doing it on a fairly small database, and they used a used 183 alternative labels, and to get that system to work well, they did some pre-training. After standard post processing they got a 20.7% error rate on a very standard benchmark dataset which is like the MNIST for speech. The best previous result on that benchmark for speech recognition was 24.4%. A very experienced speech researcher at Microsoft realized that this was a big enough improvement, that it could probably change the way speech recognition systems were done, and indeed it has.
If you look at recent results from several different leading speech recognition groups, Microsoft showed that this kind of deep neural network when used as an acoustic model in a speech recognition system reduced the error rate from 27.4% to 18.5%. Alternatively you can view it as reducing the amount of training needed from 2,000 hours down to 309 hours to get comparable performance. IBM which has the best system for standard speech recognition tasks showed that even its very highly tuned system which was getting 18.8% can be beaten by one of the these deep neural networks. Google fairly recently trained the deep neural network on a larger amount of speech for 5,800 hours. That was still much less than they trained their previous model on, but even with much less data it did a lot better than the technology they had before It reduced the error rate from 16% to 12.3%, and the error rate is still falling. In the latest version of Android, if you do a voice search it's using one of these deep neural networks in order to do very good speech recognition.