The target audience for this website includes many people whose first language is not English. Our log statistics show that, though most visitors are in English-speaking countries, this site is also popular with visitors from Brazil, Germany, France, Italy, and Belgium.
The British Council estimates that the world has about 375 million people who speak English as a first language, another 375 million who speak it regularly as a second language in a country where English has some semi-official status (such as India), and about 750 million more people who speak English as a foreign language. For more details, see Barbara Wallraff's article What Global Language? from The Atlantic Monthly for November 2000. The implication I drew from this article was that at least half the people who know English don't know it perfectly.
Thinking of my own limitations, I've tried to make this website easily understood by readers whose English is at about the same level as my French. I assume that if you're reading this page you are well educated in your own language, otherwise you wouldn't be reading about the technical subject of media research methods. I guess that you've completed your secondary education, and perhaps some tertiary education as well. I also guess that you've been learning English for many years, and know at least 10,000 words.
If you are finding this page difficult to read, perhaps you could try one of the many online translation systems. Their translations are clumsy, but they are slowly improving, and at least they save you from having to look up a dictionary for every word you don't know.
I guessed that a computerized simplicity checker might exist. It would work like Microsoft Word, which puts red squiggles under words whose spelling it doesn't know, and green squiggles under grammar that it dislikes. The simplicity checker could be a third-party add-in to Word. You could set it to different vocabulary levels, such as 5,000 or 10,000 or 20,000 words. It would put blue squiggles under rare words (not in its built-in dictionary).
If people are going to look up a word in the dictionary - or have translation software do it for them - it would be better to avoid a very common word with multiple meanings and replace it by a less common word with only one meaning: for example, writing "gift" instead of "present". Because this software could not know what was intended in a particular instance, the simplicity checker would warn the writer that a word was ambiguous, perhaps using an orange squiggly underline. With the simplicity checker, clicking the right mouse button on a word such as "present" would produce a list of synonyms, and let the writer choose the most specific word.
The third squiggle could be a purple one, placed under unclear and ambiguous common phrases. Native English speakers use these phrases without thinking, but they often puzzle learners. Why don't these learners get with it? Let's see to it that they do. They should catch on fast.
The purple phrases above (sorry, your browser can't do squiggles) would puzzle many learners of English. If they looked up those words in a dictionary, each word has so many different meanings that it would be impossible to work out the meaning of the purple sentences. (I found the same problem with Vietnamese, when I was trying to learn that language.)
The Voice of America broadcasts some programs in what it calls "Special English" (a development of Ogden's "Basic English," which was popular in the 1950s). "Special English" uses a 1500 word vocabulary, concentrating on terms found in news bulletins. But the problem with reducing the vocabulary is that phrases such as the ones shown in purple (above) all fit inside the 1500-word limit, but are not easily understood, because of the purple-squiggle problem described above. For more, see these pages on Special English and Basic English vs Special English.
Unfortunately, there seems to be no such thing as simplicity-checking software - even remotely like the description above. Audience Dialogue tried to persuade a few software developers to make their fortunes by writing this software, but with no success so far. In the meantime, there are a few widely available tools to use.
Back in the 1940s, Rudolf Flesch and others produced some readability formulas, designed to help in producing school textbooks. These formulas are very simple, mostly based on the number of syllables per word and the number of words per sentence. They don't take meaning into account at all, so it would be possible for a passage to achieve a high readability score, but still to be difficult to understand.
Most high-powered word processors now include readability statistics. Microsoft Word has this facility - though as so often with Microsoft, it's hard to find. (Wordperfect does it better.) Readability statistics are also produced by Lotus Wordpro, Nisus, and Star Writer, though we haven't tried the latest versions of those. Their readability statistics replace only the blue squiggles of our nonexistent program (rare words), not the orange squiggles (ambiguous words) or the purple squiggles (unclear phrases). However the statistics are better than nothing.
To see readability statistics in Word, you have to first enable the grammar check, then check the whole document. Only then does it deign to show you the readability statistics. This can be very slow. To speed up the grammar checking, turn off most of the settings under Tools > Options > Spelling and Grammar.
Word provides Flesch scores and Flesch-Kincaid levels. The Flesch scale ranges from 0 (very difficult) to 100 (very easy). When writing for well-educated international audiences, I aim for a score of around 60. If the score is below 40, the text becomes much more difficult to read. If the score is above 80, the text is very easy, but only the simplest ideas can be expressed.
This page (at the time of first writing) had a Flesch score of 54.1. I thought that the page could be a little simpler, so I edited it, to make it more readable. Simply by splitting a few long sentences into shorter ones, the Flesch score increased to 56.9.
The other readability score shown by Word is the Flesch-Kincaid score, which corresponds to the USA school grade: a number between1 (children aged about 6) and 12 (aged about 18). On this website we mostly aim for a Flesch-Kincaid level of 9 to 10 - equivalent to an average 15 year old in an American school. If that seems too easy, bear in mind a 1996 study by Doak, Doak, and Root. They found that 50% of US adults don't understand text beyond the grade 10 level. The Flesch-Kincaid score for this page was originally 9.9, but shortening the sentences changed it to 8.8.
I've found that instead of checking a whole document in Word, it's better to check sections of it: a screenful at a time. If you know that a section is not very readable, you know where to fix it. But when a whole document scores poorly, you don't know where the problem is.
Word Perfect's reading-ease section is easier to use than Word, and it has a wider range of tests, too. But we've stopped using Word Perfect, simply because most other people have too. A pity.
Possibly the most comprehensive softwrae for readiability calculation is Micro Power and Light. It's an odd name (sounds like a small electricity company) but their software offers nine different measure of readability. Another useful website on this topic is Kathy Schrock's Home Page, which discusses readability measurement.
How would a simplicity checker choose which 10,000 or 20,000 words to use? One clue is to study the dictionaries that learners of English are using. In my work in developing countries, I've noticed that by far the commonest English dictionary used by people I've worked with is the Oxford Advanced Learners' Dictionary - usually an old edition. Unlike other Oxford dictionaries, (which don't describe word meanings in a way that's clear to non-native speakers) the Advanced Learners' Dictionary (often called OALD) does this well. It has about 25,000 headwords: useful for advanced readers (if you're studying Shakespeare or Dickens), but many of those words aren't necessary for modern writers. I suspect you could express almost anything in English with far fewer words than the OALD contains.
Another excellent dictionary for non-native speakers of English is the new Longman Dictionary of Contemporary English. All its definitions use a standard 2000-word vocabulary, and the commonest 3000 words are shown in a different colour. Collins Cobuild Learners' Dictionary is also very good. For a long list of online dictionaries and wordlists, see the page from www.puzzlers.org on "Our Collected Wordlists".
There are at least 3 ways of counting words, so let's be clear what we mean by vocabulary. Take the simple word "jump". This is one word, but it can be either a verb or a noun. When it is a verb, it has 4 forms: jump, jumps, jumping, and jumped. When it is a noun, it has 2 forms: jump (singular) and jumps (plural). Then there are 2 extra nouns: jumper and jumpers.
The total number of different meanings is 8, but the number of different words is 6 (because jump and jumps have two meanings each - verb and noun). So there is one word family: jump. In a dictionary, there would be 3 headwords: jump (noun),jump (verb) and jumper (noun). The technical term is lemma which generally means the same as "headword."
Another term is non-orthographic words. These are phrases, spelled as two or more words, but used as if they were one word. Their meaning cannot usually be guessed, even if you know all the words. For example, "jump at" (="eagerly accept") has a different meaning from the verb "jump" and could be considered a non-orthographic word.
So depending on exactly what you mean by "word", the term "jump" can be anything from one to 8 words - not counting the non-orthographic words. On this page, when I say "word" (without otherwise defining it) I mean different spellings: such as the 6 forms of "jump." This is also the standard used in computerized spelling dictionaries.
Word comes with many English dictionaries: British, American, and Australian, as well as others. They're all large, though. I looked for a small dictionary, fitting the vocabulary of many learners of English - perhaps about 10,000 words, instead of the usual 100,000. In this I was unsuccessful. I thought of editing the standard dictionary in order to delete unwanted words, but could not do it, because the dictionary is in some encrypted or compressed format.
Next I investigated creating an "exclusion dictionary," which puts red squiggles under any words in that dictionary - but I discovered that exclusion dictionaries are limited to 10,000 words, so are not much use in this case. At best, an exclusion dictionary might reduce the main dictionary from about 100,000 words to about 90,000.
However, if you want to point out a limited number of words that are ambiguous or difficult, an exclusion dictionary could be useful. The main problem, if you are not a perfect speller, is that words underlined in red would then be either spelling errors or deliberately excluded words - and you couldn't tell which. For more on Word exclusion dictionaries, see the article How to "remove" a word from Word's main Spelling Dictionary, by Suzanne S. Barnhill and Dave Rado.
After trying various combinations of formats and directories, at last I made it work. The exclusion dictionary must have one word per line, in standard text file format. if you create it with Notepad, it will automatically be in that format. The first line is ignored, so perhaps this is reserved for a heading. I found that, using Word 2000 with Windows 2000, and "Australian English" as the regional setting, the exclusion dictionary must be called Mssp2_ea.exc and it must be in the directory
C:\Program Files\Common Files\Microsoft Shared\Proof
Finally I realized that the best use of an exclusion dictionary (for those who have no problems with spelling) is to put red lines under words that really exist in English, but are so rare than when you type one it's probably a mistake. For example, if you write "thee" you probably intended to write "there" or "these" or "three". So a large spelling dictionary is not always an advantage.
How would you choose the common words? This is not as difficult as it might seem. Many word lists are available, showing the frequency of word use in different types of English. The BNC (British National Corpus) is the most formidable of these, offering hundreds of thousands of words in descending order of use. A more succinct list is the 7,726 words in the BNC that occur at least 10 times per million words. This is from the book Word Frequencies in Written and Spoken English by Geoffrey Leech, Paul Rayson, and Andrew Wilson - published by Longman, London, in 2001. Another set of words: from a combination of the well-known Brown (US) and LOB (UK) Corpuses comes this list of the 5,066 commonest words (those occurring more than 10 times per million words) in writings mostly around the 1960s.
Whether the figure is 7,726 (at least 10 times per million) or 5,066 (more than 10 times per million), that's not a lot of words. 10 times per million words is once per 100,000 words - which is once in a 300-page book.
Research by Hirsch and Nation in 1992 [link not working, March 2007] shows that people who know less than 98% of the words in a document usually don't understand it fully. But people who know 99% of the words generally don't have problems. Thus it seems that if only 1% of the words are unknown, people can work out their meanings from their context (as I can do with a French newspaper - helped by the fact that the technical words in French can usually be guessed from their English equivalents). But if 2% of the words in a document are unknown, it becomes too difficult.
At first it's surprising that there's such a big difference between knowing 98% and knowing 99% of the words, but consider this: in the British National Corpus, the words that make up 98% of spoken English are those occurring at least 7 times in 4 million words. This list has 9,633 different words (including some non-orthographic words).
But if you want to know the words that occur 99% of the time, this involves including all the words that appear at least 3 times in 4 million words. Now we have 16,386 different words. So a person who knows 99% of words in a text, with a vocabulary of 16,386 words, knows almost twice as many different words as the person who knows 98% of the words (with a vocabulary of 9,633 words).
Combining all that information, it seems that if you know about 15,000 or more different words in English, that level of knowledge will be self-sustaining. This is the level that the philosopher Vygotsky called the "Zone of proximal development" - when a bird learns to fly, for example. Many learners reach this level after studying English for about 5 years. Here's a list of the 15,000 commonest words, based on the British National Corpus. This is the level we're aiming at on this website. If we use a word not in the most common 15,000 we'll try to explain it (just as "lemma," "headword," and "non-orthographic" have been explained in this page).
You may have noticed a hidden assumption: that the commonest words in English are those that learners are most likely to know. Though this is probably true in general, I suspect that some fairly rare words are well known to students of English, and some common phrases are not at all well known except to native speakers. Some scientific words will be easily recognized, though they are rare in most writing, because they are similar in many languages - e.g. "photosynthesis."
Around the world, millions of people have been learning English, and many of them don't know the language very well. As English is the commonest language on the Internet, you'd expect that a lot of web pages would exist with advice or suggestions on writing English for a world audience.
But in a thorough search in 2002, we found only two websites that cover this issue: one from New Zealand poet Rachel McAlpine on Quality Web Content and the other by Martin Schell, a freelance editor based in Indonesia, on Writing English as a Global Language.
Another page, by Rita Raley on What is Global English? has some interesting background information, but nothing useful on how to write Global English. A few other sites, though not addressing the issue, provide relevant information: Vocabulary resources - comments on word lists and corpuses. This site from Rob Waring in Japan describes and links to frequency lists for the 2000 commonest words. Though 2000 doesn't sound like a lot, most documents don't have many more words than that. For example, this page (at the moment of checking) had a total of 2438 words, and 749 different words (according to Catherine Ball's software - see below) and 2392 words (according to the Edict software). The difference of 46 words may be because the two programs use different definitions of exactly what a word is. (For example: is 2392 a word? Is a web address a word? Are HTML tags counted as words?)
But whether the page has 2392 or 2438 different words, that's a lot less than the 15,000 mentioned above. To avoid confusion, think of the 2392 (etc) as drawn from the 15,000.
Several programs are available on the Web, that you can use to check the vocabulary of web pages. The best of these are...
The Edict Word Frequency Text Profiler compares any document with the 2,000 and 5,000 most common word families in English. You can use this program online at the Edict Virtual Language Centre. I submitted this page to it. It told me the page has 2392 words, of which 86% are in the most common 5000 families. Because that figure was 86% instead of the 98% threshold explained above, this page will be too difficult for you if your vocabulary is only 5000 word families.
The Edict Profiler is the closest thing I've found to the simplicity checker described above, because it re-displayed this page showing the unknown and rare words in two different colours. Not surprisingly, "Addis" and "Monde" (in the top section of this page) weren't in the top 5000. But nor were "English-speaking" or "Brazil" or "improving."
How to use the Word Frequency Indexer - an online program by Catherine N Ball, that produces frequency lists of all the words in a document. I experimented with this, and found that (other things being equal) the larger a document, the more words it contains - simply because a lot of words are seldom used.
Paul Nation's site in New Zealand has two downloadable programs for Windows PCs: Range and Word. Range compares the vocabularies in several texts, and Word makes a list of word frequencies in a text.
Another program for checking word frequencies and producing concordances is TextSTAT, which can be downloaded from the Free University of Berlin. This is freeware, and runs under Windows.