Neural networks offer a general-purpose solution to pattern recognition problems. They are able to generalize much better than traditional programs and can run faster. Neural networks are not limited to any set of characters, and can learn to recognize just about anything, even things like tools, mechanical parts, aircraft, and cancerous cells.
Neural networks are also useful in determining context in conjunction with traditional OCR applications. For example, when reading a book or journal a neural network can look at the words and tell you if it's reading a title, an author, a publisher, or a date. It has been difficult to get traditional programs to quickly provide such contextual information.
Electronic Data Publishing, Inc. (Brooklyn, NY) has incorporated a neural network into its OCR/database system. The system reads documents such as journals and papers, and places information into a database for later retrieval into reports or catalogs. The neural network classifies the material read in from an OCR program into categories such as author, title, abstract, publisher or date, so that it can be tagged and stored in a database for later retrieval. "The neural network has saved $20,000 of labor costs in the first two months and allows the same number of people to get four times as much data through the system," said Ken Blackstein, designer of the neural network. The printed material contains too many variations in the data to be effectively classified using a Prolog decision tree. The neural network approach was chosen for its ability to generalize well when given ample data.
This neural network is one of the largest, most successful designs known. The 1440-input, 20-output network was trained with 200 megabytes of data using BrainMaker running on the BrainMaker accelerator board. After roughly 100 training runs, the neural network converged to 96% accuracy on all training examples. In the three months of use with new data, the neural network has made no errors.
Prior to being read by a scanner, the material is photocopied, perhaps enlarged, and cleaned up by people who may also use a felt pen to block out extraneous printed material. The printed pages are then scanned into a PC with the OmniPage (Caere Corporation) OCR program under the Windows environment. The overall system is depicted in figure 1. The words are then processed through the Soundex algorithm which reduces the number of characters and produces a "word" which is similar to a phoneme.(1) This helps the neural network to generalize, because nearly identical printed words such as "Johnson," "Jonson," and "Johnsen," will appear the same to the neural network. This also reduces the number of inputs to the neural network because Soundex "words" are comprised of fewer characters than English words. The design is similar to Sejnowski's famous "NetTalk", except that a full line of text is input rather than seven characters, and the output is a classification rather than a phoneme for speech production.(2)
The output of the neural network is used to place the text into database developed with Netware (Novell, Inc.). Currently, medical literature is on-line with 600 megabytes of data, which is roughly equivalent to 200,000 pages of printed information. Electronic Data Publishing, Inc. has plans for an Engineering database, which would require the training of another neural network that understands engineering terms.
(1) The Soundex algorithm (unknown source at this time)
(2) T. Sejnowski and C. Rosenberg, "NetTalk: A Parallel Network That Learns to Read Aloud," NeuroComputing, 1986.