SELECT * FROM Celko December 2011

www.ted.com is a website site that you ought to know as well as you know Google and Wikipedia. TED is a nonprofit organization that bills itself as “devoted to ideas worth spreading” that started in 1984. The name is short for “Technology, Entertainment, Design” and started as two annual conferences at Long Beach and Palm Springs each spring in California. Today, they host the TEDGlobal conference in Edinburgh UK each summer, run a video website, the Open Translation Project, TED Conversations, TED Fellows, TEDx programs, and the annual TED Prize.

I just watched a presentation by Louis von Ahn. He is a great speaker and an associate professor of Computer Science at Carnegie Mellon University.

You know him as one of the guys who invented Captcha (sold to Google in 2009). That is the annoying safety check on website accounts. If you have been living without Internet access, this is the distorted sample word that the user must read and re-type as a response. Bet you did not know that CAPTCHA stands for Completely Automated Public Turing Test To Tell Computers and Humans Apart. That explains the product in less than 245 words. The human brain is good at reading all kinds of fonts, distorted letters, dropping background noise in text. Computer OCR (Optical Character Recognition) software is not even close; most of the books published before 1950 cannot be scanned.

Van Ahn’s talk is on taking this basic tool and using it for a crowd-sourcing project, reCAPTCHA. The two scrambled words you see are one machine generated string and one actual word taken from a scanned book or document. When enough humans agree on the actual scrambled work, then you can be pretty sure that is what it is.

Your first thought is that scanning a book one word at a time is going to take too much time. Nope. First of all, the work is done in a massively parallel fashion and secondly, CAPTCHA is used 200 hundred million times a day. The average novel is 50 thousand words.

Van Ahn’s new project is Duolingo, which aims to get 100 million people translating the Web in every major language. The idea is a huge number of people on Earth are learning languages. People can give a better translation than machines. In China, for example, attempts at signs in English are so funny and off the mark that this “language” is called “Chinglish” and is the name of a current Broadway comedy by Tony Award winner David Henry Hwang (Madam Butterfly) about the misadventures of miscommunication. If you want to see some examples, look at http://www.dailymail.co.uk/news/article-497544/Chinglish-Hilarious-examples-signs-lost-translation.html.

But even within a language, there are style and judgment problems. I have had a regular column in a trade paper the UK. Translating my US English into UK English, even in a technical publication, showed that more than the spelling separates us. The choices of sentence structure were just a bit different. I hope this project can eventually put all of Wikipedia and Project Gutenberg on the Web for all mankind.

But can these techniques be of use to me and my little company? I think so. Have you taken an online survey? I look at the cash register slips of every place I eat or shop. More and more, I find a customer survey website, log on, type in several nine or ten digit passwords and finally get to fill in five pages of check boxes. Like the lab rat, I would get my “food pellet” … well, $5 coupon for my next meal at that establishment.

When I worked on the other side of a parking lot from a Buca de Beppo, and ate there four to five times a week, I did this ritual every day after lunch. Check the bank account to see the meal was billed and start clicking. Ritual and ritual is not information.

The surveys are also fundamentally wrong. There is a subjective experience tool called the Likert scale, which lets you rank something from 1 to 5. That range is big enough to do statistics and small enough to be repeatable over time. Most of the surveys use a 10 point scale. The questions are “store manger” things – Was the restroom clean? Were you served quickly? Was the quality of the food good? The free-text fields were what was really useful. You could comment on a good employee or a really bad one. People are more apt to attack bad service and not praise good.

In short, the company is not getting good information from a self-selected sub-population (i.e., regular customers with idle time and a computer).

Ever play a pop culture bar game where two choices are thrown out and the guys are to pick who is hottest? This probably goes back to Biblical times when Roman soldiers were hanging around a tavern in Alexandria, playing Duodecim Script (ancestor of backgammon) arguing, “Who  do you think is hotter? Cleopatra or Bathsheba?” over a horn of wine.

In the 1960s, this game would evolve into such great question as: Ginger or Mary Ann, Jeannie versus Samatha, Morticia versus or Lily, Wilma versus Betty, and Veronica versus Betty? I understand that women play a version of this game with various singers and actors who played James Bond.

What makes this game work? Well, there is beer. But after that, there is a common culture experience. You know about Gilligan’s Island, I Dream of Jeannie and Bewitched, Addams Family and The Munsters, The Flintstones and Archie Comics.

The two “products” are similar and you have a general impression of each one. WOW! That is a lot of high level cognition! Have you played “coin toss” for a restaurant? When everyone says that they cannot decide between, say, Chinese or Italian, you toss a coin – heads Chinese, tails Italian. Catch the coin and cover it, then ask if they were hoping for heads of tails. In four out of five times, people will have a preference brought into focus. Put your coin in your pocket; the decision has been made. This is more high level cognition than reading a Captcha string!

The presentation of two choices over and over until an ordering is found is a well-known technique. If you want to look at software to make this technique piratical for large option sets look up Choice Analyst.

Why not pop up a choice pair all over the web? No need to pay $5 for a detailed personal opinion. Drunks in a bar are happy to shoot pool and scream out assertions like “Ginger is a slut!” or “Wilma and Barney were cheating on Fred!” or worse.

Would it not be more useful to survey a huge population and find that people prefer Summer Squash Ravioli with pesto over Calamari Ravioli with Lemon Butter sauce as opposed to spending a fortune on learning that the restrooms are clean from a small self-selected population?

Share

submit to reddit

About Joe Celko

Joe is an Independent SQL and RDBMS Expert. He joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press. The author of six books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups. Joe is now an independent contractor based in the Austin, TX area.

Top