Sample+3


 * Proposed by Shao Yun & Derek **

Captcha Recognition **
 * Topic:

CAPTCHA, which stands for "Completely Automated Public Turing test to tell Computers and Humans Apart", has been very commonly used in Web 2.0 applications. CAPTCHAs serve a very simple and specific purpose: to deny the use of automation so as to exploit various systems in multiple ways. CAPTCHAs are unobtrusive as well, only requiring a few seconds of the user's time to derive the characters in the image. They typically exist as an image with slightly warped text in it, although audio CAPTCHAs are starting to make their way onto the web for visually impaired people. In recent years, there has been much hype about computers, termed as 'bots', decoding the CAPTCHA messages.
 * Background **

Optical Character Recognition (OCR) is a field that deals with recognising text from images. OCR technology has been rapidly advancing, resulting in technologies that allow quick “soft-copying” of hardcopy articles such as books, such as that used in the online library; Google Books. This was conventionally a daunting task as human intervention was required in order to convert the image on the books into selectable, readable text. However, with OCR, images with text can be quickly transcribed into files, eliminating the need for manually referring to the book. By doing so, one can easily preserve the book's contents; even when the book is damaged beyond repair in the future, the book's content will be stored virtually.

Through our project, we hope to investigate the various methods of breaking down a Captcha image using optical character recognition technology. Thus in doing so, prove once again that Captchas are not fail-proof.
 * Aim **

1) Preliminary research a. Captcha research b. OCR method research 2) Programming
 * Methodology **

· Effectiveness of Captchas · Ways of identifying images for text · Case studies of OCR
 * Research Scope **
 * Google Books (using the open-sourced Tesseract)

A web application that accepts a Captcha as an input, processes it in the backend, and then displays the text in standard ASCII characters. The target Captchas that we are using are alphanumeric and in the English language. T1 Week 5 Friday – Completion of Proposal + Literature Review T1 Week 6 – Preliminary research + Coding (Learning network) T1 Week 8 – Research Paper 50% + Coding (Image splitting) T2 Week 1 – Research Paper 80% + Coding (Refining & Troubleshooting)
 * End product **
 * Proposed Timeline **