Another new bot?: How CAPTCHA trains artificial intelligence, reading ancient texts

It is essentially a test aimed at distinguishing between robots and humans, banning the former and introducing the latter. That’s why it’s called the fully automated general Turing test to differentiate between computers and humans. But due to its widespread popularity and great text recognition capabilities, CAPTCHA has been used in many other ways. Take a look.

Every time we click on a grid of images to identify parts of fire hydrants, bicycles or traffic lights, we are contributing to huge data sets.

Helping researchers read ancient texts

For decades, libraries, publishing houses, and media companies trying to digitize texts have faced a timeless problem: the problem of people gathering to look at a single word and wondering: What does that word say?!

In some cases, the manuscripts were worn out; In other cases, old newsprint has faded.

Using a CAPTCHA, it turns out that ragtag crowds can help, by metaphorically coming together and risking guessing what the wrong letters might be.

Here’s how it works: Letters and words that the character recognition software couldn’t reliably decode were reused as CAPTCHA challenges, and presented in their distorted forms. When Internet users around the world typed in the message that came to mind, the results often produced an aha moment, confirming what the archivists had predicted. Other times, the test generated user consensus strong enough to produce high-confidence copies for major archives, including The New York Times.

That wasn’t all. Technology will change how archiving itself is done.

By 2007 and 2008, The New York Times had reached out to Carnegie Mellon University, where the CAPTCHA was developed, to help digitize its archive, which dates back to 1851.

By May 2009, the results were clear. “So far, puzzling words in archives spanning about 30 years have been deciphered using reCAPTCHA tests,” Mark Fronze, chief technology officer for digital operations at The New York Times, said in an article about the program in The New York Times.

Before that, typists spent nearly 10 years transcribing material, covering only 27 years’ worth of archives. Using reCAPTCHA, by 2011, the entire archive had been digitized and made publicly searchable.

Fading logo

Should users know if there is a larger task in CAPTCHAs?

The obvious answer is yes. But the goal of CAPTCHA is to reduce friction to a minimum. They actually impede traffic, simply by existing. So it was thought that users did not need to know what the test could be used for.

This, as with so much on the Internet, can be a slippery slope.

As early as 2007, some websites began experimenting with ad-supported interactivity and CAPTCHAs, asking users to rate images or resolve branded prompts, with unrevealed revenue models lurking in the background. Such attempts to monetize verification eventually raised privacy concerns and lost relevance.

Free AI training materials

Image-based reCAPTCHAs have been popularly used, and continue to be used, to train algorithms ranging from artificial intelligence models to self-driving car software.

Every time we click on a grid of images to identify parts of fire hydrants, bicycles or traffic lights, we are contributing to huge data sets. In this way, the AI learned from all of us what a dove looks like; Self-driving programs have learned how to see the world.

Users are not told that this is what clicks are for. At the same time, those who did not contribute, or could not, found themselves disadvantaged.

reCAPTCHA was particularly difficult for the differently abled. This trend for these tests is now being addressed generally to limit access to some humans.

Government-run websites in India and around the world increasingly offer audio and textual alternatives to images. Major public platforms in multilingual countries have begun to support verification in multiple languages.

Developers have also begun experimenting with design-based tests in which the user must use an understanding of symmetry and spatial reasoning, rather than culture-specific references such as “pedestrian crossing” or “traffic cone.”

Keep your robotic assistants at bay

An interesting gray area in the world of bots vs. humans involves CAPTCHA solving services.

Such services employ very low-paid workers to solve thousands of challenges every day, on behalf of automated systems. Since humans are the ones doing the solving, this does not violate existing laws. Since it is implemented so that a botnet can enter a place that is otherwise inaccessible, it is not completely halal.

Such models cause problems for websites that offer time-bound services (concert tickets, transportation companies) or that contain highly sensitive data (banking and healthcare platforms). Such sites are forced to deploy increasingly complex CAPTCHAs in an attempt to deter and defeat such attempts.

For example, the Indian Railway Catering and Tourism Corporation (IRCTC) uses complex image- and mathematics-based tests in its booking windows. Other sites have reduced the time frame within which an answer must be submitted.

Handy hacker tool?

The test to prove you’re human is now being used by hackers, ironically, to get in somewhere they don’t belong. Fake CAPTCHA screens are designed to exploit the user’s instinct to solve a puzzle quickly. Once it enters the system in this way, the program downloads malware onto the device and starts stealing data, hijacking the browser or performing any other such activity.