Foreign language spam is a problem everyone shares, not just folks who read English. If you don't understand a language, and can't read the characters of the script used, you probably won't get much value from messages you receive. Worse, you won't understand the "text" of the links embedded in the message.
Common sense would dictate that you don't click on something you can't read but decades of evidence proves otherwise. A practical solution to foreign language spam is to prevent it from appearing in your In Box.
If you are an Apple Mail (Mail.app) user, and your organization isn't filtering foreign languages at a mail server or antispam gateway, you'll find that filtering foreign spam is problematic for several reasons. First, there are thousands of languages. Many of these can be written using characters from one or more scripts and many of these can be encoded using one of (again) many character sets. I receive mail daily composed using Arabic, Chinese, Russian, Japanese, and Korean. Examining the raw source email From: and Subject: headers of these messages reveals they are encoded using UTF-8, windows-1251, GB18030 iso-2022-jp, KOI8-R, respectively. Filtering based on these encodings alone is a messy Mail Rule.
Filtering based on the "language" found in email messages has similar scaling and messiness issues. The problem is compounded by the fact that email messages encoded to support languages that are not represented using 7 bit ASCII can also have multiple body "parts" (as well as attachments) . Body parts are typically distinguished and bounded by the use of Content-Type: headers. Apple's Mail.app seems to examine only the first encountered of this header type, which is often (perhaps intentionally by spammers) multipart/alternative or multipart/mixed so a rule that is supposed to filter text/plain; charset="windows-1251" as junk will fail to match.
I found a solution at Buzzdroid that works on the principle that certain words (of, the, and, is, a) are found in nearly any rationally composed English message body. Buzzdroi'ds logic is straightforward: if these words are not present, the message is probably not "readable English". A Mail Rule of this kind looks like this:
My variation to this theme (below) is working well, as the Junk folder illustrates:
This is an imperfect solution to be sure. For example, Exchange Calendar invitations match the rule. Tempting... I created an exception rule "Calendar" and made certain the rule preceded the Foreign Language Spam rule. Occasionally I receive email composed as if it were a text or Tweet and for these, I may whitelist the sender. After 10 days of testing, I'm happier with a 1 in 10 false positive in my Junk folder than the distraction of messages I cannot read in my In Box.
As the Buzzdroid folks say, "Any Apple Mail Rule should only be the last line of defense against spam." This problem can be handled at mail servers or antispam gateways but there is an administrative cost for organizations with language diversity: while it might serve me to block all languages that do not use characters from Latin alphabets, many of my colleagues would be inconvenienced.
I'm curious to hear from others whether a similar approach would work for other languages. If you come up with an Arabic, Chinese, Japanese, Korean, or Russian "only" Rule for Mail.app, let me know.
You can follow this conversation by subscribing to the comment feed for this post.