Helsinki, 1 July, 2005
The Revolutionary Morfessor Method – Computer Learns Word Structure on Its Own
The multitude of languages in the world, even in Western Europe alone, creates many problems for software developers and, unfortunately, to the software users as well. Internet search engines are generally not able to deal with compound words or inflected forms. Despite being rare in English, compound words are rather a rule than an exception in many other languages, such as Finnish, German, or Turkish.
If you use Google to search for recipes in Finnish for making a rhubarb pie, you must almost be a linguist to do well in the search: you must take the word "raparperipiirakka" (rhubarb pie), split the compound into "raparperi + piirakka", and try to generate all relevant inflected forms of the words or of the compound ("raparperi+a", "piirakka+an", "raparperipiiraka+ssa", "raparperipiiraka+n", etc.). Then you try these one at a time and in different combinations in the search. This is both slow and very tiresome. Shouldn't this be just the kind of work that computers do for us?
Facing the Challenge: Software That Learns
The practical problem is that coding all this information manually for all the languages is a huge amount of linguistic work: unbearable, in fact, for many smaller languages that are low on research resources. To make things harder, it is impossible to foresee all the new words and their inflected forms that the system should be able to handle in the next few years. One solution is to develop methods that can learn by themselves. In this case they should learn just by looking at large amounts of text.
At Helsinki University of Technology, Finland, we have developed a method and a computer software called Morfessor that learns automatically to segment words into meaningful units. No grammar or language-specific rules need to be given, just a collection of text in the relevant language. Morfessor then learns statistically to analyze which short segments a word most probably consists of.
So far the program has been applied to Finnish, English, and Turkish, and it seems to work quite well in these very different languages.
For example, from English text the software has learned that the word "masterpieces" probably consists of segments "master + piece + s". Other words that contain the segment "master" include "schoolmaster" and "concertmaster".
From Finnish text Morfessor has learned that the segment "ssa" ("in") is likely to be a suffix (word ending), since it appears in many different word forms and often near the end of the word. It had seen examples such as "Sisilia + ssa" (in Sicily) "auto + ssa + mme + kin" (also in our car). Therefore, when it faces a new word, say "Kaledoniassa", Morfessor infers that it probably consists of segments "Kaledonia + ssa" (in Caledonia). On the other hand, it does not divide the word "kissa" (a cat) incorrectly as "ki + ssa".
How Do We Hear Words From Foreign Speech?
When one is listening to a language that is really foreign, it is at first impossible to even tell where a word ends and the next word begins. Trying to write down what was said without understanding any of it seems like an immensely hard task. This is what automatic speech recognition programs try to do.
When a computer is attempting to recognize human speech, that is, to convert the sound signal into text, to be successful the speech recognizer must have an idea of which words it may encounter. One could say that it probably cannot even "hear" a word unless the word already is in its vocabulary. In order to keep up with the speed of natural speech, the vocabulary size has to be reasonable.
Unfortunately, for example Finnish words have far too many inflected forms to be listed as such in any vocabulary – a single noun can appear in 2000 different inflected forms. Such word lists also rapidly become outdated: new compound words are invented all the time, and new foreign names rise into the spotlight of the news. With Morfessor the vocabulary can consist of shorter word segments. This means fewer and shorter words in the vocabulary, and a better ability to analyze totally new words.
Better Language Tools
When Morfessor was applied in recognizing continuous Finnish speech the rate of errors dropped remarkably: nearly to a half when compared to a standard word-based recognizer. Similar improvements were obtained in recognition of Turkish speech.
We predict that the Morfessor method could be useful also in automatic or semi-automatic machine translation, but that is a topic for another story yet to be written.
It is also imaginable that students of a foreign language, let us say, Finnish, might benefit from a method that can tell the probable segmentation points of very long foreign words. After all, most Finnish words in newspaper text cannot be found in dictionaries as such, since dictionaries do not list any inflected forms, and not too many compound words, either.
Demonstration and Free Software Package
To see for yourself how well the method actually works, you can try the demonstration on any Finnish or English words at the address www.cis.hut.fi/projects/morpho/. A free software package is provided at the same location. By making the software freely available we try to encourage the spreading of these research results into action. Hopefully with these kinds of tools, the developers of language applications can indeed make our daily lives easier!
Dr. Krista Lagus is a lecturing researcher at the Laboratory of Computer and Information Science at Helsinki University of Technology. Her research concentrates on adaptive language modeling. The Morfessor method has been designed in collaboration with Mathias Creutz.
Previously published Articles of the Month:
2002-09 School in the Grips of Change - Media Education in Finland
2002-10 Finns Work for e-Accessibility
2002-11 The Finnish Model of Information Society
2002-12 ”Silicon Valley is more than a place, its a state of mind”
2003-01 Data Security Challenges
2003-02 Lifelong Education in Upper Secondary Distance Learning Schools and Virtual Networks
2003-03 Finnish Lapland - More than Meets the Eye
2003-04 A Renewed Policy to Promote Innovation
2003-05 ICT Standardization in Europe and Globally – CEN/ISSS’s Role
2003-06 Public-Private-Partnership Works Well in Finland
2003-07 Information Technology in Nicaragua - Finland Offers a Helping Hand
2003-08 Victory Development Partnership Project - Personal and Virtual Rehabilitation for IT Employment
2003-09 Young People and Wireless Future
2003-10 Video Message Transmits Sign Language
2003-11 Combatting Spam Requires Global Co-Operation
2003-12 Saving the Earth from Anarchy by Eliminating the Weakest Link
2004-01-01 Information Society Models and the New Everyday Life
2004-02-01 Quo vadis, Finnish Virtual University?
2004-03-01 The Finnish Virtual University: Connections with the Bologna Process?
2004-04-01 "Look What I Say" - Unique Solution Enables Face-to-Face Communication for Speech Impaired
2004-05-01 Changes to Copyright Law Heavily Debated
2004-06-01 Finnish and Italian Technology in the Global Environment of the European Union: a Comparison of ICT Strategies in Education
2004-07-01 A New Law Designed to Improve Data Protection in Electronic Communications
2004-08-01 The Etno.Net Website for Practicing and Aspiring Folk Musicians Includes Recordings and Learning Material Packages
2004-09-01 Status of Wireless Service Business Today
2004-10-01 People Over Fifty in Finland as Users of Internet
2004-11-01 Preparing for Mobile Phone Viruses
2004-12-01 Distributed and Virtual Learning in Finland
2005-01-01 Online Public Services for the Benefit of Citizens
2005-02-01 Public-Private Partnership in Developing Information Society Skills
2005-03-01 Finland Shows Example in Localization
2005-04-01 The Individuals´ Awareness of the Right to Privacy
2005-05-01 Children and the Internet – Towards a Balanced Concern
2005-06-01 The Mobile Revolution: What's the Message?